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(g) Method and apparatus for storage device management 



(57) Disclosed is a volume managing system for 
computer storage devices. Physical volumes 
are logically partitioned, with multiple copies of 
data being maintained for system recovery pur- 
poses. A scheme for monitoring, updating, and 
recovering data in the event of data errors is 
achieved by maintaining a volume group status 
area on each physical volume, the status area 
reflecting status for all physical volumes de- 
fined for a given volume group. Updates to this 
status area occur serially, thereby protecting 
against all volumes becoming corrupted at 
once. A method of updating subsequent status 
changes, while the first status change is stil in 
progress, provides for improved system 
throughput 
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This invention relates in general to data proces- 
sing methods for use in data processing systems for 
managing physical storage space on a storage device 
and in particular to an improved method for maintain- 
ing redundant data on these storage devices. 

The prior art discloses a number of data proces- 
sing systems which employ disk storage devices for 
storing data employed by the system. These devices 
store various types of information such as the operat- 
ing system under which the microprocessor operates, 
different application programs that are run by the sys- 
tem and information that is created and manipulated 
by the various application programs. 

Disk storage devices have generally comprised 
one or more magnetic or optical disks having a 
plurality of concentric tracks which are divided into 
sectors or blocks. Each surface of a disk generally 
stores information and disk drives are configured with 
multiple disks and multiple heads to permit one 
access mechanism to position the heads to one of 
several concentric recording tracks. Most current disk 
drives employ an addressing convention that speci- 
fies a physical storage location by the number of the 
cylinder (CC), the number of the magnetic head (H) 
and the sector number (S). The number of the cylinder 
is also the number of the tracks where multiple heads 
are employed and the head number is equivalent to 
the disk surface in a multi-disk configuration. The 
°CCHS B addressing format is employed independent 
of the capacity of the disk file since it is capable of 
addressing any configuration that may exist 

The capacity of disk storage devices measured in 
terms of bytes is dependent on the recording technol- 
ogy employed, the track density, disk size and the 
number of disks. As a result disk drives are manufac- 
tured in various capacities, data rates and access 
times. 

Most data processing systems generally employ 
a number of disk drives for storing data. Since each 
device is a failure independent unit, it is sometimes 
advantageous to spread the data to be stored over a 
number of smaller capacity drives rather than having 
one large capacity device. This configuration permits 
a copy of critical data to be stored in a separate device 
which can be accessed if the primary copy is not avail- 
able. 

A concept known as mirroring can be used to sup- 
port replication and recovery from media failures, as 
is described in Mirroring of Data on a partition basis 
(31514), Research Disclosure # 315, July 1990, whe- 
rein data is mirrored or replicated on a partition(physi- 
cally contiguous collection of bytes) basis. This 
provides even greater flexibility in backup and recov- 
ery of critical data because of its finer granularity in 
defining storage areas to be duplicated. 

The task of allocating disk storage space in the 
system is generally the responsibility of the operating 
system. Unix (Trademark of UNIX System 



Laboratories, Inc.) type operating system such as the 
IBM AIX (Trademark of IBM) operating system which 
is employed on the IBM Rise System/6000 
(Trademark of IBM) engineering workstation have a 
5 highly developed system for organizing files. In Unix 
parlance a "file 0 is the basic structure that is used for 
storing information that is employed in the system. For 
example a file may be a directory which is merely a 
listing of other files in the system, or a data file. Each 
10 file must have a unique identifier. A user assigns a 
name to a file and the operating system assigns an 
inode number and a table is kept to translate names 
to numbers. A file name is merely a sequence of 
characters. Files may be organized by assigning 
f5 related files to the same directory, which characteris- 
tically is another file with a name and which merely 
lists the name and inode number of the fUes stored m 
that directory. 

The AIX operating system also organizes file 
20 directories in groups which are given a file name since 
they are also considered to be a file. The resultant 
organization is known as a hierarchical file system 
which resembles an inverted tree structure with the 
root directory at the top and a multi-level branching 
25 structure descending from the root Both directories 
and non-directory type files can be stored at each 
level. Files that are listed by name in a directory at one 
level are located at the next lower level. A file is iden- 
tified in the hierarchical file system by specifying its 
30 name preceded by the description of the path that is 
traced from the root level to the named file. The path 
descriptor is in terms of the directory names through 
which the path descends. If the current directory is the 
root directory the full path is expressed. If the current 
35 directory is some intermediate directory, the path des- 
cription may be shortened to define the shorter path. 

The various files of the operating system are 
themselves organized in a hierarchical file system. 
For example a number of subdirectories descend 
40 from the root directory and list files that are related. 
The subdirectories have names such as / which 
stores the AIX kernel files; /bin which store the AIX 
utilities, /tmp which stores temporary files; and /u 
which store the users files. 
45 As indicated previously the task of assigning AIX 
files to specific addressable storage units on the disk 
drive is the responsibility of the operating system. 
Prior to actually assigning a file to disk blocks, a deter- 
mination is made to divide the available disk storage 
so space of the storage subsystem into a number of dif- 
ferent areas so each area can store files having the 
same general function. These assigned areas are 
often referred to as virtual disks or logical volumes. 
The term mini-disk is used in the IBM RT system and 
55 the term A-disk in IBM's VM system. The term logical 
volume is used on IBM's AIX system. 

Several advantages are obtained from the 
standpoint of management and control when files 
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having the same characteristics are stored in one 
defined area of the disk drive. For example, a certain 
group of files may not be changed at all over a certain 
period time while others may change quite rapidly so 
that they would be backed up at different times. It is 
also simpler for the administrator to assign these files 
to a virtual disk or logical volume in accordance with 
their function and manage all the files in one group the 
same. These are just two examples of many where 
the provision of virtual disks/logical volumes simplifies 
the administration and control by the operating sys- 
tem of the storage of files in the storage subsystem. 

Conventional methods for protecting the data 
integrity in data processing systems are not efficient 
in a logical volume environment. What an end user 
perceives to be a single volume of data could actually 
have data spread across numerous physical volumes. 
US-A- 4,507,751 describes a conventional log write 
ahead method used in a conventional database. 

Methods for extending data integrity to a virtual 
disk system are shown In US-A- 4,498,145, US-A- 
4,4945,474 both and US-A- 4.930.128. However 
these methods introduce considerable overhead in 
maintaining error logs and recovery procedures. 
These systems are further limited in that a fault or 
error in the data merely results in the data being res- 
tored to an older version of the data, thereby loosing 
the updated data. 

Other methods provided data redundancy, where 
an old and new copy of data is maintained. Once the 
new copy of data is verified to be valid, it becomes the 
old copy and what was once old copy can now be 
overwritten with new data. Thus, the old and new 
copies ping-pong back and forth in their roles of hav- 
ing old or new data. As the number of physical data 
volumes is increased under this method, severe over- 
head impacts system performance and throughput in 
maintaining this technique. 

It is thus desirable to provide for a data proces- 
sing system which has a virtual disk/logical volume 
data system with data redundancy for error recovery 
that has a minimal impact on system performance. 

According, the present invention provides a 
method for managing a plurality of data storage 
devices associated with a computer system and hav- 
ing a first physical volume and subsequent physical 
volumes and being partitioned into one or more logical 
volumes, each of said logical volumes being further 
partitioned into one or more logical partitions each of 
which comprises one or more physical partitions of 
said storage devices, said method comprising the 
steps of: 

determining status information for each of said 
physical partitions and recording said status infor- 
mation in a memory of said computer system; 

recording said status information in a status 
area existing on each of said data storage devices; 

creating updated status information when a 
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write request is generated for any of said physical par- 
titions; 

updating said status area on said first physical 
volume with said updated status information; and 

5 updating said status area of each subsequent 

physical volume within said storage devices in suc- 
cession with said updated status information, wherein 
if a second or subsequent write request is received 
prior to completing an update of each of said storage 

10 device status areas as a result of a prior write request, 
said status information is updated in said computer 
memory and used in updating said next succeeding 
physical volume status area. 

The present invention also provides a computer 

f5 system including means for managing a plurality of 
data storage devices associated with said computer 
system and having a first physical volume and subse- 
quent physical volumes and being partitioned into one 
or more logical volumes, each of said logical volumes 

20 being further partitioned into one or more logical par- 
titions each of which comprises one or more physical 
partitions of said data storage devices, said managing 
means comprising: 

means for maintaining status information for 

25 each of said physical partitions in a memory of satd 
computer system; 

recording means for recording said status 
information in a status area existing on each of satd 
data storage devices; 

30 means for creating updated status information 

when a write request is generated for any of said 
physical partitions; 

first update means for updating said status 
area on said first physical volume with said updated 

35 status information; and 

subsequent update means for updating said 
status area of each subsequent physical volume 
within said data storage devices in succession with 
said updated status information, wherein if a second 

40 or subsequent write request is received prior to com- 
pleting an update of each of said data storage device 
status areas as a result of a prior write request, satd 
status information is updated in said computer mem- 
ory and used in updating said next succeeding phys'h 

45 cal volume status area. 

The present invention is directed to the aforemen- 
tioned performance problems which are introduced in 
a system which maintains multiple copies of data. In 
accordance with the new data processing method, a 

so physical partition comprising a plurality of physically 
contiguous disk blocks or sectors is established as 
the basic unit of space allocation, while the disk block 
is kept as the basic unit of addressability of the disk 
file. A plurality of physical partitions are grouped 

55 together and called a physical volume. A plurality of 
physical volumes that are grouped together is refer- 
red to as a volume group. The number of physical 
blocks contained in each physical partition and the 
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number of physical partitions in each physical volume 
is fixed when the physical vduma is installed into the 
vdume group. Stated differently, all physical parti- 
tions in a physical volume group are the same size. 
Different volume groups may have different partition 
sizes. 

When an AIX file system, i.e., a group of related 
files, is to be installed on the system, a logical volume 
is created which includes only the minimum number 
of physical partitions on the disk required to store the 
file system. As more storage spece is needed by the 
file system, the logical volume manager allocates an 
additional physical partition to the logical volume. The 
individual physical partitions of the logical volume 
may be on different disk drives. 

A partition map ts maintained by the logical 
volume manager which specifies She physical address 
of the beginning of each physics! partition in terms of 
its device address and block number on the device, to 
assist in correlating logical addresses provided by the 
system to real addresses on the disk fOe. 

Data being stored within the system can be mir- 
rored, where redundant copies cJ the data are stored 
in separate physical partitions. Mirroring is achieved 
by adding an additional structuring mechanism be- 
tween the logical volume and the physical partitions 
therewithin. The logical volume is instead made up of 
logical partitions, which function identically to the 
physical partitions previously discussed. These logi- 
cal partitions are then made up of one or more physi- 
cal partitions. When more than one physical partition 
is associated with a logical partition, the logical parti- 
tion is said to be mirrored. When a logical partition is 
mirrored then a request to read from a logical partition 
may read from any physical partftion. These multiple, 
physical partitions are where the redundant copies of 
data are stored. Thus, a logical partition's data can be 
mirrored on any number of physical partitions 
associated therewith. 

When a write request for a logical volume is 
received, data on all physical copies of the logical par- 
tition, i.e. all copies of the physical partitions, must be 
written before the write request can be returned to the 
caller or requestor. Whenever data on a logical 
volume is updated or written to, it is possible through 
system malfunctions or physical volume unavailability 
that any particular physical copy will have a write fail- 
ure. This failure causes the data in this particular 
physical copy to be incorrect, and out of synchroni- 
zation with other copies of the same data. When this 
occurs, this physical copy is said to be stale and can- 
not be used to satisfy a subsequent read request 

Status information regarding stale data must be 
stored on permanent storage so that the information 
is maintained through system crashes/reboots or 
power interruption. This stale information is stored in 
a Status Area(VGSA) that is written to all active physi- 
cal volumes in the volume group. With mirroring, this 



volume group is a collection of physical volumes 
where the physical partitions may be aDocated to 
make up the logical partitions of the logical volume. 
Each physical volume contains a copy of the VGSA so 

5 that any physical volume can be used to determine 
the state of any physical partition allocated to any logi- 
cal volume in the volume group. A volume group may 
contain many physical volumes, and a change in the 
state of any physical partition wDI result in the updat- 

10 ing of the VGSAs on each physical volume. In the pre- 
sent preferred embodiment, there is a limit of 32 
physical volumes per vdume group. 

When a physical partition goes stale due to a 
write request that has some failure, the originating 

15 request must wait for all VGSAs to be updated with the 
new state information before being allowed to be 
returned to the caller. If the first request is actively 
updating VGSAs and a second request requires a 
VGSA update, it must wait until the first is completed, 

20 causing degradation in system performance. For 
example, in a worst case scenario, where the second 
request immediately followed the first request, the first 
request would take NxQ time(where N Is the number 
of VGSAs to update and Q is the time required per 

25 VGSA) and the second request would similarly take N 
x Q time, resulting in a delay of 2N x Q for the second 
request to return to the originator. 

One possible solution is to write the VGSAs in 
parallel. However, this allows for the possibility of 

30 loosing a majority of the VGSAs due to a system 
catastrophic failure, such as a power outage in the 
middle of the write, which could potentially corrupt all 
VGSAs and therefore loose the stale status infor- 
mation for all the physical partitions. Therefore, the 

35 VGSAs must be written serially to prevent this poten- 
tial loss. 

The present invention addresses this problem of 
system degradation, when updating multiple VGSAs 
serially, by using a concept hereinafter called the 

40 Wheel. The Wheel maintains and updates the VGSAs 
on all physical volumes in the volume group for a 
given request The Wheel accepts requests, modifies 
a memory version of the VGSA as per that request, 
initiates the VGSA writes, and when all VGSAs have 

45 been updated for that request, finally returns that 
request to its originator. The Wheel also ensures that 
the request will not be held up longer than the time it 
takes to write N + 1 VGSAs( again, where N is the 
number of VGSAs and physical volumes in the 

so volume group), as opposed to other methods which 
could take as long as the time it takes to write 2N 
VGSAs. 

In order that the invention will be fully understood 
preferred embodiments thereof will now be described, 
55 by way of example only, with reference to the accom- 
panying drawings in which: 

Fig. 1 is a functional block diagram of a data pro- 
cessing system in which the method of the pre- 



BNSOOCID <EP 04828S3A2 I > 



7 



EP0 482 853 A2 



sent invention may be advantageously employed; 
Fig. 2 is a diagrammatic illustration of the hierar- 
chical file system organization of the files contain- 
ing the information to be stored on the system 
shown in Fig. 1; 

Fig. 3 is a diagrammatic Qlustration of a disk fBe 
storage device shown functionally in Fig. 1 ; 
Fig. 4 is a diagram illustrating the physical rela- 
tionships of various physical storage components 
employed in the real addressing architecture of a 
disk file; 

Fig. 5 Qlustrates the general layout of a Physical 
Volume. 

Fig. 6 illustrates the general layout of the Logical 
Volume Manager Area; 

Fig. 7 illustrates the detaBs of the Volume Group 
Status Area Structure; 

Fig. 8 illustrates the detaBs of the Volume Group 
Data Area Structure. 

Fig. 9 iBustrates the layout of a Logical Volume; 
(There Is no Fig. 10, 11 or Fig. 12.) 
Fig. 13 Qlustrates the system relationship to a logi- 
cal volume manager pseudo-device driver, 
Fig. 14 Slustrates the interrelationship between 
logical volumes, logical partitions, and physical 
partitions; 

Fig. 15 Slustrates the interrelationship between 
physical partitions, physical volumes, and volume 
groups; 

Fig. 16 illustrate the Volume Group Status Area 
WHEEL concept; 

Fig. 17 illustrates the PBUF data structure; 
Figs. 17a and 17b illustrate the PBUF data struc- 
ture elements; 

Fig. 18 illustrates the Logical Volume Device 

Driver Scheduler initial request policy; 

Fig. 19 illustrates the Logical Volume Device 

Driver Scheduler post request policy; 

Fig. 20 illustrates the Logical Volume Device 

Driver Volume Group Status Area processing; 

and 

Fig. 21 illustrates the code used to implement the 

WHEEL function. The code has been written in 

the C programming language. 

Fig. 1 illustrates functionally a typical data proces- 
sing system 10 in which embodies the method of the 
present invention for managing storage space. As 
shown in Fig. 1, the system hardware 10 comprises a 
microprocessor 12, a memory manager unit 13, a 
main system memory 14, an I/O channel controller 16 
and an I/O bus 21. A number of different functional I/O 
units are shown connected to bus 21 including the 
disk drive 17. The information that is stored in the sys- 
tem is shown functionally by block 11 in Fig. 1 and 
comprises generally a number of application prog^ 
rams 22, the operating system kernel 24 which in this 
instance may be assumed to be the AIX operating 
system. Also shown is a group of application develop- 



ment programs 23 which may be tods used by prog- 
ram development personnel during the process of 
developing other programs. 

An example of a commercial system represented 

5 by Fig. 1 is the IBM Rise System/6000 engineering 
workstation which employs the AIX operating system. 
The AIX operating system is a Unix type operating 
system and employs many of its features including 
system calls and tie organization. 

io Fig. 2 illustrates the ftle organization structure of 
the AIX operating system. The basic unit of infor- 
mation stored is termed a °file. a Each file has a name 
such as "myjfile.OOr. Res may be grouped together 
and a list generated of all fBe names in the group. The 

15 list is called a directory and is per se a file, with a name 
such as o my_dirBct010°. The organization shown in 
Fig. 2 is called an inverted tree structure since the root 
of the file organization is at the top. The root level of 
the organization may contain directory files and other 

20 type files. As shown in Fig. 2, a root directory ftle lists 
the names of other files 00 A, ObB, 00C, 00D, and DOE. 
The files listed in a directory file at one level appear 
as files at the next lower level. The file name includes 
a user assigned name and a path definition. The path 

25 definition begins at the root directory which, by con- 
vention is specified by a °slash character, 0 (/) followed 
by the file name or the directory name that is in the 
path that must be traced to reach the named file. 
Each of the program areas shown in block 1 1 in 

30 Fig. 1 includes a large number of individual files which 
are organized in the manner shown in Fig. 2. The term 
"File System 0 is used to identify a group of files that 
share a common multi-level path or a portion of their 
respective multi-level paths. 

35 The method of the present invention junctions to 

manage storage space on the disk drive 17 shown in 
Fig. 1 for all of the files represented in block 1 1 of Fig. 
1 and the files that would be represented on the 
hierarchical storage system shown in Fig. 2. 

40 The disk drive 17 in practice may comprise a 

plurality of individual disk drives. One such device is 
shown diagrammatically in Fig. 3. The device as 
shown in Fig. 3 comprises a plurality of circular mag- 
netic disks 30 which are mounted on a shaft 31 which 

45 is rotated at a constant speed by motor 32. Each sur- 
face 33 and 34 of the disk 30 is coated with magnetic 
material and has a plurality of concentric magnetic 
tracks. Other embodiments would have disk 30 
coated with material to allow optical storage of data. 

50 The disk drive 17 further includes a mechanism 
35 for positioning a plurality of transducers 36, one of 
each being associated with one surface, conjointly to 
one of the concentrically recording track positions in 
response to an address signal 36 supplied to actuator 

55 37 attached to move carriage 38. One recording track 
on each surface of each disk belongs to an imaginary 
cylinder of recording tracks that exist at each track 
position. 
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The physical address to the disk drive takes the 
form of a five byte address designated °CCHS° where 
CC represents the cylinder or track number, H repre- 
sents the number assigned to the magnetic head or 
transducer which also corresponds to the disk surface 
since there is one head per surface, and S represents 
the sector or block number of a portion of the track. 
The block is established as the smallest unit of data 
that can be addressed on the device. Other embodi- 
ments could support other physical head to disk con- 
figurations and still be within the scope of this 
invention. For example, instead of a single head or 
transducer corresponding to each disk surface, mul- 
tiple heads or transducer might be utflized to reduce 
the seek time required to attain a desired track loca- 
tion. 

From a programming standpoint a disk drive is 
sometimes referred to as a Physical Volume (PV) and 
is viewed as a sequence of disk blocks. A Physical 
Volume has one device address and cannot include 
two separate disk devices since each device has a 
separate accessing mechanism and requires a 
unique address. 

Fig. 4 illustrates the physical relationship of the 
various storage elements involved in the addressing 
architecture of a disk drive which to a large extent is 
generally standardized in the industry. 

Each byte position 40 stores one byte of data. The 
sector or block 41 comprises a specified plurality of 
sequential or contiguous byte positions generally 512 
and is the lowest level of an addressable element. 
Sectors or blocks 41 are combined into tracks 42 , 
which are combined into surfaces 33 and 34, which 
are combined into disks 31 ,32 .... which are com- 
bined into disk drives or disk storage devices 17 of 
Fig. 1. If more than one disk storage device 17 is 
employed the combination of two or more devices is 
referred to as a physical string of disk drives or disk 
files. In practice a disk or a disk track 42 may contain 
one or more sectors 17 having a number of defects 
sufficient to render the block unusable. 

The layout of a Physical Volume is shown in Fig. 
5. Each physical volume, for example each separate 
disk drive, reserves an area of the vdume for storing 
information that is used by the system when the power 
is first turned on. This is now a standard convention 
in the industry where, for example, tracks or cylinders 
0-4 are reserved for special information. 

Each physical volume reserves at least two cylin- 
ders for special use. The Boot Code, which may be 
used to load diagnostics software, or the kernel of the 
Operating System, is held in a norma! logical volume 
and no longer requires a special physical volume 
location. 

The first reserved cylinder is cylinder 0, the first 
cylinder on any physical volume. Each physical 
volume uses the first four tracks of cylinder 0 to store 
various types of configuration and operation infor- 



mation about the Direct Access Storage Devices 
(DASD) that are attached to the system. Some of th is 
information is placed on the cylinder by the physical 
volume manufacturer, and some of it is written by the 

5 operating system on the first 4 tracks of cylinder 0. 

The second reserved cylinder on the physical 
volume is for the exclusive use of the Customer 
Engineer and is called the CE cylinder. This is always 
the last cylinder on the physical volume and is used 

10 for diagnostic purposes. The CE cylinder cannot be 
used for user data. The Boot Code area and the Non- 
Reserved area are pointed to by the contents of an 
I PL Record interpreted in the context of the contents 
of a Configuration Record. 

15 The Initial Program Load (IPL) Record consisting 
of one block contains information that allows the sys- 
tem to read the Boot Code (if any) and initialize the 
physical volume. The IPL Record can be divided into 
four logical sections: The first section is the IPL 

20 Record ID. The second section contains format infor- 
mation about the physical volume. The third section 
contains information about where the Boot Code (If 
any) is located and Its length. The fourth section con- 
tains information about where the non-reserved area 

25 of the physical volume is located and its length. 

One track is also reserved for the Power On Sys- 
tem Test (POST) control block that is created in mem- 
ory during system initialization. 

The first part of the non-reserved area of a physi- 

30 cal volume contains a Logical Volume Manager Area. 
The invention hereinafter disclosed is primarily con- 
cerned with the management of this Logical Volume 
Manager Area. Fig. 6 is an exploded view of the Logi- 
cal Volume Manager Area, which has a Volume 

35 Group Status Area and Volume Group Data Area. 
Secondary copies of these areas may also 
immediately follow the primary copies. To save space 
on the physical volumes, the size of this Logical 
Volume Manger Area is variable. It is dependent on 

40 the size of the physical volume and the number of logi- 
cal volumes allowed in the volume group. 

As previously mentioned, each physical volume 
contains a Volume Group Status Area(VGSA). The 
Status Area indicates the state of each physical par- 

45 tition on the physical volume. Every physical volume 
within a volume group contains an identical copy of 
the Status Area. The Status Area can be duplicated 
on the same physical vdume, is not contained within 
any physical partition, and has the format shown in 

50 Fig. 7. The Status Area should be allocated on DASD 
in such as way as to reduce the probability of a single 
failure obliterating both copies of it. 

The details of the Status Area are shown in Fig. 
7. The various fields within the Status Area are inter- 

55 preted as follows: 

BEGINNINGJTIMESTAMP and ENDING_TIMES- 
TAMP are used when the VG is varied on to validate 
the VGSA and control the recovery of the most recent 
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VGSA. Each ttmestamp is 8 bytes long. The recovery 
and validation process wfll be discussed later. 
PART1TION_STATE_FLAGS occupy the remainder 
of the VGSA. The flags are evenly diverted among the 
maximum of 32 PVs bi a VG. That means each PV has 
127 bytes of state dags. That leaves 24 bytes of the 
4098 in each VGSA unused. It also limits the number 
of PPs on any PV to 127 x 8 or 1016 partitions. This 
should not restrict the use of any portion of any disk 
since the size of the partitions are not a factor, only the 
quantity. 

As previously mentioned, each physical volume 
contains a Volume Group Data Area(VGDA). The 
Data Area indicates the interrelationship between the 
logical and physical volumes and physical partitions. 
The Data Area can be duplicated on the same physi- 
cal volume, is not contained within any physical par- 
tition, and has the format shown in Fig. 7. The Data 
Area should be allocated on DASD in such as way as 
to reduce the probability of a single failure obliterating 
both copies of it The details of the Data Area are 
shown in Fig. 8. The various fields within the Data 
Area are described at pages 8 and 9 of Fig. 21. The 
VGDA is a variable sized object that is user defined 
when the Volume Group is created. 

Referring again to Fig. 5, a User Area follows the 
Logical Volume Manager Area, and contains the nor- 
mal user data area. 

A bad block pool area in Fig. 5 is also provided 
which supplies substitution blocks for user area 
blocks that have been diagnosed as unusable. It will 
be assumed in the remaining description that there 
are no bad blocks on the disk or if there are they are 
handled by any of the well known prior art techniques. 

Fig. 9 indicates the layout of a logical volume 
where block numbers are decimal. Logical partition 
size shown is 64 Kilobytes (128 logical blocks). 

In the preferred embodiment, the method of the 
present invention is implemented by a file 
named /dev/lum which is called the Logical Volume 
Manager. 

The Logical Volume Manager (LVM) provides the 
ability to create, modify and query logical volumes, 
physical volumes and volume groups. The LVM auto- 
matically expands logical volumes to the minimum 
size specified, dynamically as more space is needed. 
Logical volumes can span physical volumes in the 
same volume group and can be mirrored for high 
reliability, availability, and performance. Logical 
volumes, volume groups and physical volumes all 
have IDs that uniquely identify them from any other 
device of their type on any system. 

The LVM comprises a number of operations per- 
formed by calls to the SYSCONFIG system call. 
These SYSCONFIG calls including the processes for 
creating and maintaining internal data structures hav- 
ing volume status information contained therein. 
These system calls are more fully described in the 



IBM manual °AIX Version 3 for RISC System/6000, 
Calls and Subroutines 0 Reference Manual: Base 
Operating System, Vol. 2. 

A Logical Vdume Manager pseudo device driver 
5 64 is shown if Fig. 1 3, and consists of three concep- 
tual layers. A strategy layer 65 interfaces to the file 
system I/O requests 68, a scheduler layer 66 to be 
described later, and a physical layer 67 which inter- 
faces to the normal system disk device drivers 69, 
10 both logical and physical. This pseudo device driver 
64 intercepts file system I/O requests 68 destined to 
and from the disk device drivers 69 and performs the 
functions of mirroring, stale partition processing, 
Status Area management, and Mirror Write Consis- 
ts tency, all of whose operations and functions will now 
be described. 

Mirroring 

20 Mirroring is used to support replication of data for 
recovery from media faflures. Normally, users have 
specific files or filesystems that are essential and the 
loss of which would be disastrous. Supporting mirror- 
ing only on a complete disk basis can waste a con- 

25 stderable amount of disk space and result in more 
overhead than is needed. 

A partition is a fixed sized, physically contiguous 
collection of bytes on a single disk. Referring to Figure 
14, a logical volume 70 is a dynamically expandable 

30 logical disk made up of one or more logical partitions 
71 . Each logical partition is backed up by one or more 
physical partitions, such as 72, 74 and 76. The logical 
partition is backed up by one(72) if the partition is not 
mirrored, by two(72 and 74) if the partition is singly 

35 mirrored, and by three(72, 74, and 76) if the partition 
is doubly mirrored. 

Mirroring can be selected in the following ways for 
each logical volume: (i) none of the logical partitions 
in a logical volume can be mirrored, (ii) all of the logi- 

40 cal partitions in a logical volume can be mirrored, or 
(iii) selected logical partitions in a logical volume can 
be mirrored. 

Stale Partition Processing 

45 

In order for mirroring to function properly, a 
method is required to detect when alt physical parti- 
tion copies of the mirrored data are not the same. The 
detection of stale physical partitions(PP) and initiation 

50 of stale physical partition processing is done in Fig. 13 
at the scheduler layer 66 in the driver 64. This 
scheduler layer has two I/O request policies, initial 
request and post request The initial request policy 
receives and processes requests from the strategy 

55 layer 65 and is illustrated in Fig. 18. The post request 
policy interfaces with the physical layer 67 and is illus- 
trated in Fig. 19. A description of the functions within 
these policies follows. 
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Initial Request Policy - 
REGULAR 

Returns EIO for request that avoids the only copy, 
or the target PP is being reduced, or target PV is mis- 
sing. If request is a special V6SA write, REQ_VGS A 
set in b_options, then a special pbuf(sa_pbuf) is used 
which is embedded in the volgrp structure, instead of 
allocating a pbuf from the free pool. 

SEQUENTIAL 

Returns EIO for request that avoids all active 
copies(mirrors). Read requests select the partition to 
read from in primary, secondary, tertiary order. Write 
requests select the first active copy and initiate the 
write. The remaining copies are written in sequential 
order, primary, secondary, tertiary, after the preced- 
ing partition has been written. Sequential only initiates 
the first physical operation. Any subsequent oper- 
ations, due to read errors, or multiple writes are han- 
dled by the post request policy. Sequential does not 
write to partitions that are stale or are on PVs with a 
status of missing. 

PARALLEL 

Returns EIO for request that avoids all active 
copies(mirrors). Read requests read from the active 
partition that requires the least amount of PV head 
movement based on the last queued request to the 
PV. Write requests generate writes to all active parti- 
tions simultaneously, i.e. in parallel. PARALLEL does 
not write to partitions that are stale or are on PVs with 
a status of missing. 

AVOID 

Builds an avoid mask for the mirrored policies. 
This mask informs the scheduling policy which parti- 
tions to avoid or not use. Following is a description of 
when mirrors are to be avoided. 

GENERAL - Applies to both read & write 
requests. 

i) Non-existent partitions or holes in a logical par- 
tition. 

ii) Explicitly avoided by requesL There are 
bits(AVOID_C1,2,3) in the b_option field of the 
request that explicitly avoid a copy(used for read 
requests only). 

READ - Applies to read requests only. 

i) Partitions located on PVs with a status of mis- 
sing. 

ii) Partitions that are in the process of being 
reduced or removed. 

iii) Partitions that have a status of stale. 
WRITE - Applies to write requests only. 



i) Partitions that are in the process of being 
reduced or removed. 

ii) Partitions that have a status of stale and that 
status ts not in transition from active to stale. 

5 iii) If there is a resync operation in progress in the 

partition and the write request is behind the cur- 
rent position of the resync position then allow the 
write even if the partition status is stale. 
If the request is a resync operation or a mirror write 
10 consistency recovery operation, the sync-mask is 
also set The sync-mask informs the resyncpp post 
request policy which partitions are currently stale and 
therefore which ones to attempt to write to once good 
data is available. 

15 

Post Request Policy 
FINISHED 

20 Generally the exit point from the scheduler layer 
back to the strategy layer. Responsible for moving 
status from the given pbuf to the Ibuf. If the pbuf is not 
related to a VGSA write, REQ_VGSA set In bop- 
tions, the pbuf is put back on the free list 

25 

MIRREAD 

Used by both sequential and parallel policies 
when the request is a read. It has the responsibility of 

30 checking the status of the physical operation. If an 
enror is detected it selects another active mirror. It 
selects the first available mirror in primary, secondary, 
tertiary order. When a successful read is complete 
and there were read errors on other mirrors MIRREAD 

35 will initiate a fixup operation via FIXUP. 

SEQWRITE 

Used by the sequential policy on write requests. 
40 It has the responsibility of checking the status of each 

write and starting the write request to the next mirror. 

Writes are done in primary, secondary, tertiary order. 

When all active mirrors have been written, any mirrors 

that failed are marked stale by the WHEEL(to be des- 
45 cribed hereafter). 

PARWRITE 

Used by the parallel policy on write requests. The 
50 initial parallel policy issued physical requests to all 
mirrors in parallel. PARWRITE checks the status of 
each of the completed physical requests. PARWRITE 
remembers only if a write error occurred or noL PAR- 
WRITE puts the pbufs back on the free list as they 
55 complete and coalesces the status into an outstand- 
ing sibling. Therefore the last physical request to com- 
plete holds the pass/fafl status of all the siblings 
including itself. If any write errors are detected the 
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affected mirrors will be marked stale{by the WHEEL) 
only after all physical requests for a given logical 
request are complete. 

FIXUP 

Used to fix a broken mirror, one that had a read 
error, after another minor was read successfully. 

RESYNCPP 

Used to resynchronize a logical partition, LP,. The 
initial policy, sequential or parallel, selects an active 
mirror, one not stale or on a missing PV, to read from 
first RESYNCPP checks the status of the read. If an 
error is detected RESYNCPP will select another mir- 
ror if one is available. Once a successful read has 
been done RESYNCPP will write that data to any stale 
physical partition in the LP. RESYNCPP does not 
attempt to fix broken mirrors, i.e. ones that failed the 
initial read. RESYNCPP is also used to do MIRROR 
WRITE CONSISTENCY RECOVERY(MWCR) oper- 
ations. During MWCR operations RESYNCPP will 
mark partitions stale if a write fels in the partition. 

SEQNEXT 

Used to select the next active mirror considering 
the ones already used, stale partitions, and missing 
PVs. 

Referring now to Fig. 14, each PP 72 that is 
defined in the volume group has state information in 
the partition structure. Each PP must be in one of two 
permanent states. It can be active, available for all I/O, 
or stale, not available for all I/O. In addition, there are 
two intermediate states, called reducing and chang- 
ing. The permanent state of each PP in the volume 
group 84 is also maintained in the Status Area(VGSA) 
82, as shown in Fig. 6f. A copy of the VGSA 82 resides 
on each physical volume 80 of the volume group 84. 

This allows the state of each partition to be 
retained across system crashes and when the VG is 
not online. The driver 64 has the responsibility of 
maintaining and updating the VGSA. Stale PP proces- 
sing is not complete until all VGSAs have been 
updated. The VGSAs are updated by a mechanism 
hereinafter called the WHEEL, which is described 
hereafter, and references will be made that indicate 
requests will be going to or returning from the VGSA 
WHEEL 90 shown in Fig. 16. The VGSA WHEEL'S 
request object is the physical request(PR) or pbuf 
structure. It accepts pointers to PRs and returns them 
via the pb_sched field of the same structure when all 
VGSAs have been updated. 

Following is a description of each PP state. 



ACTIVE 

The partition Is avalable for all I/O. Read 
requests for the LP can read from this PP. Writes to 
5 the LP must write to this PP. 

STALE 

The partition cannot be used for normal I/O. The 
10 data in the partition is inconsistent with data from its 
peers. It must be resynchronized to be used for nor- 
mal I/O. It can be reduced or removed from the LP. 

REDUCING 

15 

The partition is being reduced or removed from 
the LP by the configuration routines. Any reads or 
writes that are currently active can be completed 
because the configuration routines must drain the LV 

20 after putting the partition in this state. The initial 
request policies must avoid ttils PP if a read request 
is received when the PP is in this state. The configu- 
ration routines wBl also turn on the stale flag under 
certain conditions to control write requests that may 

25 be received. These configuration routines are further 
described hereafter. 

CHANGING 



so The partition has changed states from active to 

stale and the initial request that caused that change 
has not been returned from the VGSA WHEEL A read 
request to a LP that has a PP changing must avoid the 
PP. A write request to the LP cannot be returned until 

35 the WHEEL returns the initial request that caused the 
state change. This is done by actually building the PR 
and then handing it off to the VGSA WHEEL. The 
WHEEL handles duplicate operations to the same 
partition and will return them when the initial request 

40 is returned. 

There are some general rules that apply to logical 
requests(LR) and PPs when they encounter state PP 
processing. First, once a partition goes stale it cannot 
accidentally become active again due to a system 

45 crash or error. There is one exception to this, if the VG 
was forced on with the force quorum flag the selected 
VGSA may not have contained the latest PP state 
information. If a user forces the VG, they take their 
chances. Secondly, a LR will not be returned until all 

so stale PP processing is complete. This means that all 
VGSAs have been updated. 

It is an llegal state for all copies of a logical par- 
tition(LP) to be marked stale. There must be at least 
one active partition. That one active partition can be 

55 on a PV that is missing. All writes to that LP will fail 
until the PV is brought back online. Of course the 
entire LP can be reduced (removed) out of the LV. 
If all copies of a LP have write faiures then all but 
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one copy will be marked stale before the LR is retur- 
ned with an error. Since there is no guarantee that all 
the writes faBed at the same relative offset in the PRs, 
the assumption must be made that possible inconsis- 
tencies exist between the copies. To prevent two dif- 
ferent reads of the same logical space from returning 
different results (i.e. they used different copies), the 
number of active partitions must be reduced to one for 
the LP. The only exception to this is when no PR has 
been issued before the detection that all copies will 
fafl, which might occur if the logical volume(LV) is 
using the parallel write policy, which is described 
hereafter. 

There are three ways a PP can become stale. The 
first is by the system management mechanism that 
extends a LP horizontally when valid data already 
exists in at least one PP for this LP or the PP is being 
reduced(removed). This will be referred to as the con- 
fig method. 

A partition can become stale when a write to its 
respective LP is issued and the PV where the PP is 
located has a status of missing. This type of stateness 
is detected before the physical request is issued to the 
PV. This w3l be referred to as the missing PV method. 

Finally, a PP can become stale when a write to it 
is returned with an error. This will be referred to as the 
write error method. 

A more detailed discussion of action and timing 
for each method follows. 

CONFIGURATION METHOD 

The config method is really outside the normal 
flow of read and write requests that flow through the 
scheduler layer. It is important that the driver and the 
configuration routines stay in sync with each other 
when the state of the PP is changing. A set of proced- 
ures is defined later that covers how this is done. 

MISSING PV METHOD 

Detecting that a PR is targeted for a PV that has 
a status of missing must be done before the request 
is issued, as shown in Fig 19. All mirrored write poli- 
cies 96. 100, 122 and 124 must check the target PVs 
status before issuing the PR to the lower levels 106. 
If that status is detected the PR will be sent to the 
VGSA WHEEL 90. The PR must have the proper post 
policy encoded, the B_DONE flag reset in the b_flags 
field and a type field that requests this PP be marked 
stale. The post request policy, Fig. 19, will decide 
what action is next for this LR when the VGSA WHEEL 
returns the PR. The one exception to this is in the ini- 
tial request parallel policy. If it detects that all active 
partitions are on PVs with a status of missing, it can 
return the LR with an error, EIO, and not mark any par- 
titions stale. It can do this because the data is still con- 
sistent across all copies for this request 



WRITE ERROR METHOD 

When a request is returned from the physical 
layer 108, to the post request policy shown in Fig. 19, 

5 with an error, the policy must decide if the partition is 
to be marked stale. There are several factors invdved 
when deciding to mark a mirror stale. A few are men- 
tioned below with references to Fig. 19. 

If the post policy is sequential 98 and this is the 

10 last PR for this LR and all other previous PRs fafled 
(and their partitions marked stale), then this partition 
cannot be marked stale. If it were marked stale then 
all copies of this LP would be marked stale and that 
is an illegal state. 

15 Resync operations 1 02 do not mark mirrors stale, 
but if the write portion of a resync operation fails then 
the faOing partition cannot be put into an active state. 

Mirror write consistency recovery operations will 
mark a mirror stale if the write to the mirror fails. 

20 In any case, if the partition is to he marked stale 
the PR must be set up to be sent to the VGSA 
WHEEL This entails the proper post policy be 
encoded, the B_DONE flag reset (in the bjlags field) 
and a type field be set that requests this PP be marked 

25 stale. When the PR is returned by the VGSA WHEEL 
the receiving post policy will decide what action is next 
for this PR and the parent LR. 

Any post request policy that receives a PR from 
both the physical layer 1 08 and the VGSA WHEEL 90 

30 must query the B_DONE flag in the b_flags field to 
determine the origin of the PR. Since the post request 
policy handles PRs, from both the physical layer and 
the VGSA WHEEL, it makes all the decisions con- 
cerning the scheduling of actions for the request and 

35 when the LR request is complete. 

Now that the states of a PP have been defined, 
the procedures for handling a request in relationship 
to those states must be defined. Also defined are the 
procedures the configuration routines and the driver 

40 must follow for changing the states in response to sys- 
tem management requests. 

Driver only procedures 

45 1) State is active 

Read requests may read from the partition. 

Write requests in the initial request policies must 
write to the partition. 
so Write requests in the post request policies of Fig. 
1 9 that are returned with an error must: 

i) Turn on the changing flag and stale flag. The 

partition has just changed states. 

i) Remember that the PR failed. 
55 iii) Hand the PR off to the VGSA WHEEL 90. 

iv) When the PR is returned from the WHEEL 90 

the changing flag must be turned off. The partition 

has just changed states again. 

10 
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2) State Is stale 

Read requests in the initial request policies must 

avoid the partition. 

Write requests in the initial request policies will 

avoid the partition. 

Write requests in the post request policies that 

are returned with an error must 

i) Remember that the PR failed. Since the chang- 
ing state flag is not on, that is all the action neces- 
sary at this point There is no need to send this 
request to the WHEEL 90 because the partition is 
already marked stale. This condition can happen 
because the partition was active when the 
request was handed off to the disk device driver, 
but by the time it returned another request had 
already failed, marked the partition stale via the 
VGSA WHEEL and returned from the WHEEL. 
Therefore, there is nothing for this request to do 
concerning stale PP processing. 

3) State changing from active to stale. 

Read requests in the initial request policies of Fig. 
18 must avoid the partition altogether. 

Write requests in the initial request policies of Fig. 

1 8 must be issued to the disk drivers as if the partition 
was not changing states. The post request policies of 
Fig. 19 will handle the request when it is returned. 

Write requests in the post request policies of Fig. 

1 9 that are returned with an error must 

i) Remember that the PR failed. 

ii) Hand the PR off to the VGSA WHEEL 

iii) When the PR is returned from the WHEEL the 
changing flag should have been turned off. This 
request can now proceed. 

NOTE: Post request policies that find that a read 
request has failed just select another active partition 
and retry the read. Read errors do not usually cause 
data inconsistencies, but, as usual, there is one 
exception. There is a post request policy called fixup 
100. This policy attempts to fix broken mirrors, i.e. 
ones that have had a read error. It fixes these broken 
mirrors by rewriting them once a successful read has 
been completed from another mirror. If the rewrite of 
a broken mirror fails this partition must be marked 
stale since it is now possible for the data to be incon- 
sistent between the mirrors. 

Configuration - Driver procedures 

1) Partition created stale 

When a LP that already contains valid data is hori- 
zontally extended, the resulting driver structures and 
the VGSA must indicate that the partition is stale. This 
means the VGSAs must all be updated before the 
configuration operation is considered complete. A 



more detailed procedure can be found in the VGSA 
discussion to follow. 

i) Configuration routines set up the permanent 
state of each PP being allocated via the VGSA 

5 WHEEL See the VGSA discussion for a more 

detailed breakdown of what must be done in this 
step. 

ii) Configuration routines set up driver structures 
and link them into the existing driver information. 

10 If the partition is active it can he used 
immediately. If stale, it must be resynchronized 
before it will be used. This step should be done 
disabled to INTIODONE to inhibit any new 
requests from being scheduled while the configu- 

75 ration routines are moving driver structures 
around. 

2) Reducing an active or stale partition 

2o The procedure below will work for reducing both 
an active partition or a stale partition. It Is very high 
level. A more detailed procedure can be found in the 
VGSA discussion. 

i) The configuration routines set the state flag for 
25 each PP being reduced(removed) and initiates 

ihe update of the VGSAs. This is done via the 

configuratioiWGSA WHEEL interface. 

NOTE: This is not necessary if all PPs being 

reduced are already state. 
30 ii) With the state of each partition now stale and 

recorded permanently the LV must he drained. 

Draining the LV means waiting for all requests 

currently in the LV work queue to complete. 

NOTE: This is not necessary if all PPs being 
35 reduced are already stale. 

iii) Disabled to INTIODONE, the configuration 
routines may now remove the driver structures 
associated with the PPs being removed. 

40 3) Stale PP resynchronization 

Up to this point the discussion has centered on 
marking partitions stale. There is another side to this 
issue. How is the data made consistent between 

45 copies so all are available and active again? This 
operation is called a resync operation. The 
resynchronization of an entire LP is accomplished by 
an application process issuing, via the character 
device node of the LV, multiple resync requests start- 

50 ing at the beginning of the LP and proceeding sequen- 
tially to its end. This must be done by issuing readx 
system calls with the ext parameter equal to 
RESYNC_OP as defined in sys/lvdd.h. Each request 
must start on a logical track group(LTG) boundary and 

55 have a length of one LTG. A LTG is 128K bytes long. 
Therefore, to resynchronize a 1 MB LP a series of 8 
of these requests would have to be made. After the 
8th resync operation if there were no write errors in the 

11 
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partition by any operation, resync or normal write, the 
VGSA is updated to indicate that the newly 
synchronized partitions are now fresh and active. 

Each resync request is made up of several physi- 
cal operations. The first operation ts a read and it is 
initiated by the initial request policy of Fig 1 8. The post 
request policy of RESYNCPP in Fig. 19 verifies that 
the read was done without errors. If an error is retur- 
ned another active mirror is selected to read from. If 
there are no other active mirrors the resync request is 
returned with an error, and the resynchronization of 
the LP is aborted. It must be restarted from the begi- 
nning. 

The next operation of the resync request is to 
write the data, just read, to any stale partitions using 
a sequential write type policy. If an error is returned 
the partition status is updated to indicate the partition 
has had a sync-error. 

If all stale partitions have a status of sync-error 
the resynchronization of the LP is aborted. If all the 
LTGs of one PP are successfully resynchronized then 
that PP will change status from stale to active. 

Following is a list of actions and decisions sur- 
rounding the resynchronization of a LP. 

i) The synchronization of a LP is initiated by issu- 
ing a resync request at the first LTG in the parti- 
tion and proceeding sequentially through the last 
LTG in the partition. The LP must have stale mir- 
rors or this initial request will be returned with an 
error. The sync-error status of each PP will be 
cleared and the internal resync position(LTG 
number) maintained by the driver will be set to 0. 
This internal resync position is referred to as the 
sync track. A sync track value that is not OxFFFF 
indicates that this LP is being resynchronized and 
what track is currently being done or was last 
done. There is a flag in the PP state field which 
qualifies the sync track value; it is called the Re- 
sync-ln-Progress(RIP) flag. When the RIP flag is 
on, the sync track value represents the LTG cur- 
rently being operated on. If the RIP flag is reset 
the sync track value represents the next LTG to 
he operated on. This is how the driver remembers 
the position of the resynchronization process and 
allows norma! read/write operations to proceed 
concurrently with the resynchronization of the LP. 

ii) A LTG will be resynced if the partition is stale 
and the sync-error flag is reset 

iii) Any write error in the LTG being resynced will 
cause that partition to have a status of sync-error. 
Writes in a LP that occur behind the sync track 
write to all PPs even though they may be stale. 
The exception to this is if the partition has the 
sync-error flag on. Consequently, any write errors 
cause the copies to be inconsistent again. There- 
fore, these write errors must turn on the sync-er- 
ror flag to let the resynchronization process know 
that an error has occurred behind it in this parti- 
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tion. 

iv) An individual resync request is considered 
successful if at least one of the LTGs currently 
being resynced completes with no errors. 
s v) If all stale partitions in a LP develop a status of 
sync-error the resynchronization of the LP is 
aborted. It must be restarted from the beginning 
of the LP. 

Recovery of the VGSA at VG varyon time is 
10 addressed by the discussion of the VGSA and the 
VGSA WHEEL. 

VOLUME GROUP STATUS AREA (VGSA) 

is Each physical partition(PP) in the volume 
group(VG) has two permanent states, active or stale. 
These states are maintained in the Status 
Area(VGSA). There is a copy of the VGSA on each PV 
in the VG, as shown in Fig. 15. Some PVs may have 

20 more than one copy. The VGSA copies on all PVs, 
along with a memory verskin, are maintained by 
software in the driver 64 that runs in the scheduler 
layer 66 in Fig. 13. This software accepts requests 
from the scheduling policies of Fig. 18 & Fig. 19 or the 

25 configuration routines to mark partitions stale or 
active. This software is called the WHEEL because of 
the way it controls and updates the active VGSAs in 
the VG. Refer to Fig. 16 for the following discussion 
on the WHEEL 

30 The basic object of the WHEEL is to ensure that 

all VGSAs in the VG are updated with the new state 
information for any given WHEEL requesL It would be 
easy and relatively fast to issue write requests to all 
VGSAs at the same time. But, that would also be very 

35 dangerous since with that method it is possible to 
have a catastrophic error that would cause the loss of 
all VGSAs in the VG. That brings up the first of the 
general rules for the WHEEL 

40 General Rule 1) 

Only one VGSA write can be in flight at a time. 
Refer to Fig. 16. When a request is received by 
the WHEEL the memory version of the VGSA is 

45 updated as per the request Then VGSA 1 is written. 
When it is complete a write to VGSA 2 is issued. This 
continues until VGSA 8 has been written. The WHEEL 
is now back at VGSA 1 where it started. Now the 
request is returned back to the normal flow of the 

50 driver, as shown in Fig. 19, so it can continue to its 
next step. The second general rule is: 

General Rule 2) 

55 A request cannot be returned until all VGSAs in 

the VG have been updated with that request's oper- 
ation. 

It should be obvious now why this is called the 
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WHEEL It should be equally obvious that any request 
on the WHEEL may stay there a while. In the above 
example the request had to wait for 8 complete disk 
operations. If a VG contained 32 PVs a request would 
have to wait for 32 disk operations. Now. assume 5 
while the request was waiting for the write to VGSA 1 
to complete another request came in and wanted to 
update the VGSA. If the second request had to wait 
for the first request to get off of the WHEEL it would 
have to wait for 1 6 disk operations before it could corv 10 
tinue. Eight disk operations for the first request and 8 
for itself. This wait time could become quite large if the 
VG contained a large number of PVs. Luckily, the 
WHEEL has several conditions to reduce this wait 

The WHEEL manages the requests it receives so 15 
that no one request must wait, stay on the wheel if you 
will, longer than the time it takes to write the total num- 
ber of VGSAs in the VG plus one. This is accom- 
plished by allowing requests to get on the WHEEL 
between VGSA writes. A request then stays on the 20 
WHEEL untQ the WHEEL rolls back around to the 
position where the request got on the WHEEL. Once 
the WHEEL has been started it is said to be rolling. 
Once rolling it will continue to write the next VGSA 
untO all VGSAs in the VG contain the same infor- 25 
mation regardless of how many requests get on and 
off the WHEEL or how many revolutions it takes. This 
is sometimes called free wheeling. 

In reference to the above two request scenario, 
the following would happen. Request #1 comes in 30 
from the initial or post request policies, as shown in 
Figs. 18 and 19, and causes the WHEEL to start rol- 
ling by writing to VGSA 1 . Request #2 comes in and 
waits for the write to VGSA 1 to complete. When that 
write is complete, request #2 updates the memory 36 
version of the VGSA. When that is done VGSA 2 is 
written. When that completes VGSA 3 is written. This 
continues until VGSA 1 is the next one to be written. 
At this point request #1 is returned to the normal flow 
of the driver 64 since all VGSAs reflect the status 40 
change requested by request #1. Now the write to 
VGSA 1 is started since that VGSA does not match 
the image of VGSA 8. When that write is complete 
request #2 is returned to the normal flow of the driver. 
This is done because the WHEEL has rotated back to 45 
the position where request #2 jumped on. Now. 
because VGSA 2, the next to be written, and VGSA 
1 , the last written, are identical the WHEEL stops. The 
next request will start the WHEEL at VGSA 2. 

There is one other major WHEEL condition. It is 50 
called piggybacking. It is very likely, given the nature 
of disk drives, that several disk requests will fail in the 
same partition. This wfll result in all of them wanting 
to change the state of that partition. Depending on the 
length of time between these like failures it would be 55 
possible for these like state change requests to get on 
the WHEEL at various positions. That is where pig- 
gybacking comes in. Before a request is put on the 



WHEEL a check is made to see if a like request is 
already on the WHEEL If one is found the new 
request is piggybacked to the one already there. 
When it comes time for the first request to get off of 
the WHEEL any piggybacked requests get of? also. 
This allows the like state change requests to get off 
sooner and keeps the WHEEL from making any 
unnecessary writes. 

This is not contradictory to the second general 
rule because it states that all VGSAs must have been 
updated with a request's information before it is retur- 
ned. Piggybacking meets that requirement because 
all the piggybacked requests are doing the same 
thing. Therefore, they can all get off the WHEEL at the 
same position regardless of where they jumped on. 
However, the initial request policies and the post 
request policies must be aware of any PP that is 
changing states. Otherwise, they may return a 
request early believing a partition to be already mar- 
ked stale when in fact there is a previous request on 
the WHEEL doing just that Thfe second request must 
be piggybacked to the one currently on the WHEEL 
This additional intermediate state may be quite long, 
relatively. A PP is considered in a changing state from 
the time the decision is made to change states until 
the time that request gets off of the WHEEL. During 
that time any I/O requests that are targeted for a par- 
tition that is changing states must follow the rules 
stated in the stale PP processing discussion. 

We have seen how the WHEEL handles indivi- 
dual PRs that are changing the state of a single par- 
tition. But, there is another major aspect to the 
partition state methodology. That is the configuration 
routines. These routines want to set the state of many 
partitions as the LV is extended or reduced while it is 
open and in use. To accomplish this there must be a 
mechanism available and procedures defined that 
allow the configuration routines to: 

i) pause the WHEEL if it is rolling; 
otherwise keep it from starting 

ii) set the state of multiple partitions 

iii) restart the WHEEL and wait for all the VGSAs 
to be updated 

This all must be done in a way that maintains LV integ- 
rity during the life of the operation, even across sys- 
tem crashes. 

Refer now to Fig. 20 for the following WHEEL pro- 
cedures. 

Volume Group Status Area 
START 

Called to change the status of a partition. This can 
be caused by two different mechanisms. First, a write 
failure in a mirror logical partition, LP. Second, an 
existing LP is extended, made wider, and valid data 
exists in the original. In this case, the newly created 
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partitions are stale in relationship to the original. 
START always puts any new request on the hold list 
SA_HLD_LST. Then, if the wheel is not rolling, it will 
start it 

SA CONT 

This block has several responsibilities. First it 
checks to see if a configuration operation is pending. 
Since the VGSA wheel is free wheeling, once it is star* 
ted, a configuration operation must wait until it the 
wheel, gets to a stopping point before any changes 
can be made to the in-memory version of the VGSA. 
There is a stopping point for modifications between 
writes to the physical volumes. The wheel will not be 
started again until the update to the memory version 
of the VGSA by the configuration process is complete. 
Then the configuration process will restart the wheel. 
The second major function is to remove any requests 
on the holding list SA_HLD_LST, scan the active list 
SA_ACT_LST, for like operations. If a like operation 
is found, then associate this request with the previous 
one. This allows the new request to get off of the 
wheel at the same point as the request that is already 
on the active list If no like operation is found on the 
active list, then update the memory version of the 
VGSA as per that request If a loss of quorum(to be 
described hereafter) is detected, then flush all the 
requests that are on the wheel. 

WHLADV 

Advance the wheel to the next VGSA 

REQ STOP 

Starting at the head of the active list 
SA_ACT_LST t check each request that has not just 
been put on the list If this is the wheel position where 
the request was put on the wheei then remove it from 
the active list and return it to it's normal path. After the 
active list has been scanned for completed requests 
a check is made to see if the memory version of the 
VGSA has been written to the target VGSA on the PV. 
If the memory VGSA sequence number does not 
match the PV VGSA sequence number a write to the 
VGSA is initiated. 

NOTE: Once the wheel is started it will continue to 
write VGSAs until the memory VGSA sequence num- 
ber matches the PV VGSA that will be written next. 
Also, if the VGSA to be written next is on a PV that is 
missing the wheel will be advanced to the next VGSA 
and the active list is scanned again. When an active 
VGSA is finally found the wheel position of this VGSA 
is put in any new requests that were put on the active 
list by SA CONT. This indicates where they are to get 
off when the wheel comes back around to this posi- 
tion. Therefore, no request gets put on the wheel at an 



inactive VGSA. But requests can get off at a position 
that has gone inactive while the request was on the 
wheel. 

5 WRITE SA 

Builds request buffer and calls regular(Fig. 18), to 
write the VGSA to a PV. 

io SAIODONE 

The return point for the request generated by 
WRITE SA. If the write failed the PV is declared as 
missing and a quorum check is made. If a quorum is 
is lost due to a write failure SA IODONE only sets a flag. 
The actual work of stopping the wheel and flushing the 
active list is done in SA CONT. 

LOST QUORUM 

20 

The volume group(VG) has lost a quorum of 
VGSAs. Sets flag to shutdown I/O through the VG. 
Return all requests on the wheel with errors. 

Various modifications may be made in the details 
25 of the preferred embodiment described above. 

Following are the high level procedures for the 
various configuration management functions used to 
maintain a VG when they interact with the WHEEL 

30 EXTENDING A LV 

When extending any LV, even when there are no 
mirrors in any LP, the permanent state must be 
initialized in the VGSA. There is a general assumption 

35 when extending a LV that any partition being allocated 
is not currently in use and that the VGDAs have not 
been updated to indicate this partition is now allo- 
cated. It is further assumed that the write of the 
VGDAs is near the end of the overall operation so that 

40 the LV maintains integrity if disaster recovery is 
needed. There are some conditions that can be 
implemented for this procedure and they will be men- 
tioned. 

i) Get control of the WHEEL. That means if it is rol- 
45 ling, stop it. If it is not rolling, inhibit itfrom starting. 

ii) Modify the memory version of the VGSA 

iii) Restart or start the WHEEL. Wait for it to com- 
plete one revolution. 

NOTE: If the WHEEL was not rolling and there were 
so no state changes in the memory version of the VGSA 

then there is no need to restart the WHEEL and wait 

for it to complete a revolution. 

NOTE: If the WHEEL was rolling and there were no 

state changes in the memory version of the VGSA 
55 then restart the WHEEL but there is no need to wait 

for it to complete a revolution. 

iv) Disable to INTIODONE. Link the new partition 
structures into the driver hierarchy. Re-enable 
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interrupt level. Read/write operations can now 

proceed on the PP if it is active. 
NOTE: It is assumed the new partition structures con- 
tain the same permanent state as was just initialized 
in the VGSA. 

REDUCING A LV 

Reducing an active LV must be done with care. In 
addition to the integrity issues that must be addressed 
there is the likelihood that there is I/O currently active 
in the PPs that are being removed. 

i) Get control of the WHEEL. That means if it is rol- 
ling stop it If it is not rolling inhibit it from starting. 

ii) Disable to INTIODONE. For all LPs being 
reduced a check must be made to ensure that at 
least one active PP is left in the LP before the 
reduction of any LP can proceed. The only excep- 
tion to this is if all PPs are being reduced from the 
LP, thereby eliminating it If any LP wilt be left with 
no active PP, the entire reduce LV operation 
should fail. For each PP being reduced, turn on 
the reducing flag in its respective partition struc- 
ture. Also, and this is a big also since it has to do 
with integrity in the face of disaster recovery, IF 
the PP is a member of a LP with multiple copies 
AND not all of the PPs of this LP are being 
removed, AND the PP being removed is not stale, 
THEN the changing and stale flags must be tur- 
ned on also. IF the stale flag is turned on THEN 
the memory version of the VGSA must be 
updated also. THEN, re-enable the interrupt 
level. 

This is somewhat complicated but must be done 
this way to ensure that a PP does not come back 
active after system crash if the VGDAs don't get 
updated before the crash and a write may have 
occurred in the LP before the crash. If all PPs of 
a LP are being removed then the reduce flag will 
keep any new requests out of the LP. Then if the 
system crashes the data wfll still be consistent be- 
tween all copies upon recovery. 

iii) IF the memory version of the VGSA was mod- 
ified THEN start/restart the WHEEL AND wait for 
it to complete one revolution. IF the memory ver- 
sion of the VGSA was not modified THEN release 
the inhibit on the WHEEL and restart it if it was rol- 
ling when we started. 

iv) Drain the LV. This means wait for all requests 
currently in the LV work queue to complete. 

v) Disable to INTIODONE. Now remove the par- 
tition structures from the driver hierarchy for the 
PPs being removed. Re-enable interrupt level. 

vi) The VGDAs can be written. 

NOTE: If the VGDAs cannot be written and the reduce 
operation fails, the PPs wDI remain in their current 
state of reducing and/or removed from the driver 
hierarchy and, therefore, will not be available for I/O. 



ADDING A PV TO AN EXISTING VG 

When a PV is added to the VG a VGSA on the PV 
must be added to the WHEEL Since a PV that is 
s being added cannot have any active PPs the acti- 
vation of the VGSA becomes easy. The only real con- 
cern is disaster recovery and even that is simplified. 

i) The configuration routines must initialize the 
disk VGSA that wOl be activated. The configu- 

10 ration routines have two options they can lay 
down a VGSA with a content of binary zeros or 
they can get the current image of the memory ver- 
sion of the VGSA via the IOCTL The only critical 
issue is that the timestamps must be zero to 

is insure that this new VGSA will not be used by 
varyonvg if the system crashes before adding the 
PV is complete. 

ii) Get control of the WHEEL That means if it is 
rdling stop it. If it is not rolling inhibit it from start- 

20 ing. 

iii) Disable to INTIOD6NE. Insert physical 
volume structure into volume group structure. 

IF the WHEEL was rolling THEN make It rotate at 
least back to the position just added. This may 

25 cause some extra writes to PVs that already have 

current VGSAs but, this will be so infrequent it 
should not cause any noticeable delays. 
If the WHEEL was not rolling THEN re-position 
the WHEEL controls to the position just before the 

30 newly added position. This is so we won't spin the 

WHEEL one whole revolution. The controls 
should be set up to make the WHEEL believe the 
new position is the last position to be written on 
this revolution. This way only the new VGSA is 

35 written and all the others currently on the WHEEL 
are not rewritten with the same data they already 
have. Since the memory version of the VGSA has 
not changed due to the addition it is only import- 
ant that the current version be written to the new 

40 disk. It is not important to rewrite the same infor- 

mation on all the other disk VGSAs. 

iv) Re-enable to interrupt level. 
Startfre-start the WHEEL 

NOTE: When the WHEEL stops or the requests 
45 from the configuration routines gets off the 
WHEEL the VGSA is now active and will be 
updated if a PP changes state. It is assumed the 
VGDAs will be written sometime after the VGSA 
is activated. Even if the writing of the VGDA on the 
50 new PV fails the VGSA will remain active unless 

there is a deftned mechanism to come back down 
into the kernel part of LVM and remove it 

v) Increment the quorum count in the volume 
group structure. 

55 

DELETING A PV FROM AN EXISTING VG 

Deleting a VGSA from the WHEEL is probably the 
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simplest operation of them all. This is due to the fact 
the PV has no active PPs. 

i) The VGDAs should be updated to indicate the 
PV is no longer in the VG. 

ii) Get control of the WHEEL That means if it is 
rolling stop it If it is not rolling inhibit it from start- 
ing. 

iii) Disable to INTIODONE. Check the position of 
the WHEEL If it is resting on the position to be 
removed then advance the controls and remove 
the physical volume structure from the volume 
group structure. If the WHEEL is rolling and the 
next position to be written is the position to be 
removed adjust the WHEEL controls to skip it 
then remove the physical volume structure from 
the volume group structure. If the WHEEL was not 
rolling or the position was not in any of the des- 
cribed situations then just remove the physical 
volume structure from the volume group struc- 
ture. 

iv) If the WHEEL was rolling then restart it If It was 
not rolling then remove the inhibit 

There is no need to wait for one revolution of the 
WHEEL since no information in the VGSA has 
changed. This same procedure should be followed 
when deleting a PV with a status of missing. 

REACTIVATING A PV 

Reactivating a PV really means a defined PV is 
changing states from missing to active. This will hap- 
pen when a PV is returned or by a re-varyonvg oper- 
ation. The same procedure that is used for adding a 
PV can be used here. It is mentioned here just to rec- 
ognize that the condition exists and that it needs no 
special processing, outside the defined add PV pro- 
cedure, to reactivate the VGSA. 

VARYING ON A VG 

Varying on a VG is really a configuration and 
recovery operation as far as the WHEEL is concer- 
ned. Both of these are discussed later. But it is import- 
ant to note at this point that the WHEEL should not 
become active until a regular LV has been opened in 
the VG. This means the WHEEL does not become 
active until after the varyonvg operation is complete. 

VARYING OFF A VG 

There is only one way to varyoff a VG but, there 
are two modes, normal and forced. The only real dif- 
ference between them should be that the forced mode 
sets the VGFORCED flag. This flag tells the driver this 
VG is being forcefully shutdown. A force off will stop 
any new I/O from being started. In addition the 
WHEEL will stop, if it is rolling, at the completion of the 
next VGSA write and return all requests on the 



WHEEL with errors. If it is not rolling it will be inhibited 
from starting. The same procedure should be followed 
for a normal varyoff but it should not encounter any 
problems. This is because a normal varyoff enforces 

5 a NO OPEN LV strategy in the VG before it continues. 
So, if there are no open LVs in the VG there can be 
no I/O in the VG. If there is no I/O in the VG the 
WHEEL cannot be rolling. Only one procedure has 
been designed to handle both instances. 

10 i) If this is a normal varyoff then enforce NO OPEN 
LVs. If this is a force off then set the VGFORCED 
flag. 

ii) Quiesce the VG. This wfll wait until all currently 
active requests have been returned. This really 

15 only applies to the force off mode since it may 
have I/O currently active in the VG. 
NOTE: During this time any write requests that have 
failures in any mirrored LP partition will have to he 
returned with an error, even if one partition worked 

20 correctly. This is because the VGSA cannot be 
updated to indicate a PP is now stale. Because the VG 
is being forced off the mirror write consistency cache 
(described hereafter) has been frozen just like the 
VGSA Therefore, the disk versions of the minor write 

25 consistency cache remember that this write was 
active. Now, when the VG is varied back on, the mirror 
write consistency recovery operation will attempt to 
resynchronize any LTG that had a write outstanding 
when the VG was forced off. Since a mirror write con- 

30 sistency recovery operation just chooses a mirror to 
make the master, it may pick the one that failed at the 
time of the forced varyoff. If this is so, and it is read- 
able, the data in that target area of the write wfll revert 
back to the state it was before the write. Therefore, an 

35 error is returned for a logical request that gets an error 
in any of its respective physical operations when the 
VG is being forced off and the VGSA cannot be 
updated to indicate a PP is now stale. See the dis- 
cussion on mirror write consistency for more details 

40 concerning correctness versus consistency. 

iii) The driver hierarchy for this VG can now be 
removed and the system resources returned to 
the system. 

There are just a few more areas yet to cover con- 
45 cerning the VGSA. They are initial configuration, 
VGSA recovery, and, finally, a quorum of VGSAs. Ini- 
tial configuration will be covered first. 

The driver assumes the configuration routines will 
allocate memory for the memory copy of the VGSA 
so and put a pointer to it in the volume group structure. 
The configuration routines will select a valid VGSA 
and load an image of the selected VGSA into that 
memory VGSA before any mirrored LVs are opened. 
In addition, there are several other fields in the volume 
55 group structure that will need to be initialized since 
there is a reserved buf structure and pbuf structure 
embedded in the volume group structure. These 
structures are reserved for VGSA I/O operations only. 
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This guarantees that there is a buf structure used by 
log teal requests and a pbuf structure used by physical 
requests for any VGSA operation and therefore elimi- 
nates dead lock conditions possible if no pbuf struc- 
tures were avaOable from the general pool. The 
configuration routines also control what PVs have 
active VGSAs and where they are located on the PV. 
These fields are in the physical volume structure and 
must be set up also. 

The next topic concerning the VGSA is its recov- 
ery and/or validity at VG varyon time. It is the configu- 
ration routines' responsibility to select a VGSA from 
the ones available on the PVs in the VG. This selec- 
tion process is exactly like the selection process for 
selecting a valid VGDA. It uses the timestamps of Fig. 
7 that are present at the beginning and end of the 
VGSA. Each time the memory version of the VGSA is 
changed, those timestamps are updated to reflect the 
system time. The configuration routines must select a 
VGSA where the beginning and ending timestamps 
match and they are later In time than any other VGSA 
available, lit goes without saying that the entire VGSA 
must be read without errors. Once a VGSA is selected 
the configuration routines must use the state flags to 
initialize the driver partition structures. If the configu- 
ration routines find a VGSA that is out of date, relative 
to the selected one, or has read errors the configu- 
ration routines will rewrite(recover) the VGSA before 
the VG is allowed to have normal I/O activity. If the 
VGSA cannot be rewritten without errors the PV must 
not be used and declared missing. 

The last VGSA issue to address is the quorum of 
VGSAs. Like the VGDAs there must be a quorum of 
VGSAs for the volume group to stay online. If a VGSA 
write fails, the PV is declared missing, and all active 
VGSAs on that PV are therefore missing also. At this 
time a count is made of all currently active VGSAs and 
if that count is below the quorum count set up by the 
configuration routines, the VG is forcefully taken 
offline, tf the VG is forced offline all requests currently 
on the WHEEL are returned with an error(SIO) if they 
did not have an error already. The WHEEL is stopped 
and wOl not accept any new requests. In order to reac- 
tivate the VG it must be varied off and then varied back 
online. 

For the enabling code which implements the 
WHEEL see Fig. 21. 

MIRROR WRITE CONSISTENCY 

By far the largest problem with any system that 
has multiple copies of the same data in different 
places is ensuring that those copies are mirror images 
of each other. In the preferred embodiment, with LVM 
there can be up to three copies of the same data 
stretched across one, two or even three physical 
volumes. So when any particular logical write is star- 
ted it is almost guaranteed that from that point on the 



respective underlying copies will be inconsistent with 
each other if the system crashes before all copies 
have been written. Unfortunately, there is just no way 
to circumvent this problem given the nature of 

5 asynchronous disk operations. Fortunately, all is not 
lost since LVM does not return the logical request until 
all the underlying physical operations are complete. 
This includes any bad block relocation or stale PP pro- 
cessing. Therefore, the user cannot assume any par- 

10 ticular write was successful until that write request is 
returned to him without any error flags on. Then, and 
only then can that user assume a read w3l read the 
data that was just written. What that means is, LVM 
will concentrate on data consistency between mirrors 

is and not data correctness. Which in turn means, upon 
recovery after a system crash any data a logical write 
was writing when the system went down may or may 
not be reflected in the physical copies of the LP. LVM 
does guarantee that after a system crash the data be- 

20 tween all active PPs of a LP will be consistent It may 
be the old data or it may be the hew data, but all copies 
will contain the same data. This is referred to as Mirror 
Write Consistency or MWC. 

There is one restriction on the guarantee of con- 

25 sistency. The volume group cannot have been 
brought online without a quorum. The user has the 
ability to force a VG online even if a quorum of VGDAs 
and VGSAs are not available. If this forced quorum is 
used then the user accepts the fact that there may be 

30 data inconsistencies between copies of a LP. 

Since the PPs may not be stale the normal resync 
could not be used. Alternatively, a simple function to 
read from the LP followed directly by a write to the 
same logical address would be sufficient to make all 

35 copies consistent. It could run in the background or 
foreground, but in either case it would be time con- 
suming. 

Mirror write consistency is accomplished by 
remembering that a write has started and where it is 

40 writing to. It is very critical to remember that the write 
was started and where it was writing but less critical 
when it completes. This information is remembered in 
the mirror write consistency cache or MWCC. So, if 
the system crashes, recovery of the PPs within the 

45 LPs being written becomes a function of interpreting 
the entries in the MWCC and issuing a mirror write 
consistency recovery(MWCR) I/O operation through 
the LV character device node to the affected area of 
the LP. These MWCR operations must be done bef- 

50 ore the LV is available for general I/O. The details of 
MWCC will now be described. 

There is one MWCC per VG and it is made up of 
two parts. The first part, sometimes referred to as the 
disk part and sometimes just part 1 , is the part that is 

55 written to the physical volumes. Therefore it is the part 
that is used to control the MWCR operations during 
recovery. Details of part 1 is discussed later. The sec- 
ond half of the MWCC is the memory part or part 2. 
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Part 2 of the MWCC is memory resident only. It comes 
into being when the VG is brought online. There are 
many aspects to controlling a cache such as hashing, 
ordering, freeing entries, etc, that have nothing to do 
with the recovery of the data in the mirrors and there- 
fore does not need to be written to <§sk or permanent 
storage. That is why there are two parts to the MWCC. 
Part 1 or the disk part is written to the disk while part 
2 is not Each PV holds one copy of the MWCC. A 
more detailed breakdown of each part of the MWCC 
follows. 

PART 1 - DISK PART 

Part 1 of the MWCC is 512 bytes long, a disk 
block. PV disk block 2 is reserved for the copy. Part 
1 has 3 basic parts. 

i) A beginning timestamp 

ii) Cache entries 

Hi) An ending timestamp 

The timestamps are used during recovery to vali- 
date the MWCC and select the latest copy from all the 
available PVs in the VG. The timestamps are 8 bytes 
long each. There are 62 cache entries between the 
timestamps though they might not ail be actively being 
used by the VG. The size of the active cacbe is vari- 
able between 1 and 62 entries. The size of the active 
cache is directly proportional to the length of time it will 
take to recover the VG after a system crash. This 
recovery time will be discussed later. The system is 
currently implemented to use 32 cache entries. Alter- 
nate embodiments could provide a command line 
option so it is tuneable. 

Each part 1 cache entry has 2 fields. 

i) Logical Track Group(LTG) number 

The cache line size is a LTG or 128K bytes. It is 
aligned to LTG boundaries. For example, if the 
number of active cache entries in the MWCC 
were 32, there could be no more than 32 different 
LTGs being written to at any point in time in the 
VG. 

ii) LV mirror number 

The mirror number of the LV that the LTG belongs 
in. 

PART 2 -MEMORY PART 

Each part 2 entry of the MWCC is made of several 
fields. Since part 2 is memory resident ifs size is not 
important here. It is important to know that there is a 
direct one to one correspondence with the cache 
entries of part 1. Therefore if there are 32 cache 
entries being used in part 1 , part 2 has 32 entries also. 

i) Hash queue pointer 

Pointer to the next cache entry on this hash 
queue. Currently 8 hash queue anchors exist in 
the volume group structure. 

ii) State flags 
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NO CHANGE - Cache entry has NOT changed 
since last cache write operation. 
CHANGED - Cache entry has changed since last 
cache write operation. 
5 CLEAN - Entry has not been used since the last 

clean up operation. 

Hi) Pointer to the corresponding part 1 entry 

Pointer to the part 1 entry that corresponds to this 

part 2 entry. 
10 iv) I/O count 

A count of the number of active I/O requests in the 

LTG that this cache entry represents. 

v) Pointer to next part 2 entry 

Pointer to the next part 2 entry on the chain. 
is vi) Pointer to previous part 2 entry 

Pointer to the previous part 2 entry on the chain. 

It is important to know that there are two parts to 
each cache entry, but, from this point a reference to 
a cache entry means the entity formed by an entry 
20 from part 1 and the corresponding entry from part 2. 

The concept of the MWCfc is deceivingly simple, 
but the implementation and recovery is noL Part of 
this complexity is caused by the fact that a VG can be 
brought online without all of its PVs. In fact, after a 
25 system crash the VG can be brought online without 
PVs that were there when the system went down. 

There are two major areas of discussion concern- 
ing the MWCC. There is the function of maintaining, 
updating, and writing the cache as the driver receives 
30 requests from the various system components. This 
is the front side of the operation. It is done so it is 
known what LTGs may be inconsistent at any one 
point in time. Then there is the backside of the oper- 
ation. That is when there has been a system crash or 
35 non-orderly shutdown and the MWC caches that 
reside on the PVs must be used to make things con- 
sistent again. So for now the focus of this discussion 
will be on the front side of the operation, which also 
happens to be the first step. 
40 The driver will allocate memory and initialize it for 

both parts of the cache when the first LV is opened in 
the VG. The driver assumes the MWCC that reside on 
the disks have already been used by the recovery pro- 
cess to make LTGs consistent and that those disk 
45 blocks can be written over without loss of data. The 
MWCR(mirTor write consistency recover) operation is 
really a read followed by writes. Since the MWCC is 
watching for writes the MWCR operations done at 
varyon time slip by without modifying the disk copy of 
so the MWCC. 

As MWCC is an entity that must be managed as 
requests are received there is a Mirror Write Consis- 
tency Manager(MWCM). The MWCM sits logically at 
the top of the scheduler layer between the scheduler 
55 and the strategy layer. It does not have a whole layer 
by itself since ifs only concern is with mirrored parti- 
tion requests but it is easier to understand if you view 
it there. 

18 
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As the initial request policies receive requests 
they will make some initial checks to see if the request 
should be handed off to the MWCM. Following is a list 
of conditions that would cause a request not to be 
handed off to the MWCM. This does not mean every 
request that does get handed off to the MWCM is 
cached. The term cached is used rather loosely here. 
A request is not cached in the classical sense but in 
the sense it must wait for information concerning if s 
operation to be written out to permanent storage bef- 
ore the request is allowed to continue. So the MWCM 
may return the request to the policy to indicate this 
request may proceed. 

i) The request is a read. 

ii) The LV options has the NO mirror write consis- 
tency flag turned on. 

iii) The request specifically requests NO mirror 
write consistency via the ext parameter. 

tv) There is only one active partition in this LP and 
no resync is in progress in the partition. That 
could mean there is only one copy or all the others 
are stale. 

As mentioned earlier, each PV has a block reser- 
ved for the MWCC. But also mentioned was the fact 
that they may all be different The memory image of 
the MWCC is a global view of the LTGs in the VG that 
have writes currently active. But, the disk copies of the 
MWCC are really only concerned with the information 
in the cache that concerns writes to LPs that have PPs 
on that PV. For any given logical write request the 
MWCC will be written to the PVs where the PPs are 
located for that logical write request 

As an example, if a LP has 3 copies, each on a 
different PV, a copy of the MWCC will be written to 
each PV before the actual data write is started. All the 
writes to the disk copies of the MWCC will be done in 
parallel. Even if one of the PPs is stale the MWCC on 
that disk must be written before the write is allowed to 
proceed on the active partitions. The MWCC must be 
written to PV with a stale minor in case the PVs that 
contain the active mirrors are found to be missing dur- 
ing a varyon. Of course if a PV is missing the MWCC 
can not be written to it. This is more of a recovery 
issue and will be discussed later. 

Once the MWCM receives a request there are 
several other test that have to be made before the final 
disposition of the request The MWCM will do one of 
the following for each request: 
NOTE: Remember these decisions are hased on the 
cache line size of a LTG or 128K and not the size of 
the physical partition. Also, the term cache here deals 
exclusively with the memory version and is therefore 
global to any tV in the VG. 

IF (the target LTG is not in the cache) OR (the 
target LTG is in the cache AND it is changing) THEN 

i) Modify the cache - either add it to the cache or 
bump the I/O count 

ii) Move the cache entry to the head of the used 



list 

iii) Initiate writing the cache to the PVs if needed 

iv) Put the request on the queues waiting for 
cache writes to complete 

5 v) When all cache writes complete for this request 

then return the request to the scheduling policy 
IF (the target LTG is in the cache AND it is not 
changing ) THEN 

i) Bump the I/O count 

10 ii) Move the cache entry to the head of the used 
list 

iii) Return the request to the scheduling policy 
There are some exceptions to the above logic, 
however. Since the cache is a finite entity it ts possible 

is to fill it up. When that happens the request must go to 
a holding queue until a cache entry is avafoble. Due 
to the asynchronous nature of the driver and lengthi- 
ness of disk I/O operations, which includes MWCC 
writes, a special feature was added to the disk drivers 

20 to help, but not eliminate, the problem. This feature 
allows the driver to tell the distt drivers to not HIDE the 
page. This means the driver can reference the MWCC 
even if the hardware Is currently getting data from 
memory. Because of this the driver must take care to 

25 maintain hardware memory cache coherency while 
the MWCC is in flight to any PV. 

Therefore, in the first test if either condition is true 
and the MWCC is in flight, being written, the request 
will have to go to a holding queue until the MWCC is 

30 no longer in flight When the last MWCC write com- 
pletes, i.e. the MWCC is no longer in flight, the 
requests on this holding queue can proceed through 
the MWCM. Remember, that the hardware will trans- 
fer the MWCC data to the adapter hardware buffer 

35 long before the actual write takes place. If the infor- 
mation in the cache is changed after this hardware 
transfer and before receiving an acknowledgment that 
the data has been written, then a window exists where 
what is acknowledged to be on the disk is different 

40 from what is really there. In this case, if the request 
continues, the disk version of the MWCC may or may 
not know that this write is active. This u ncertainty can- 
not be allowed. 

In the first test if the second condition is true and 

45 the MWCC is in flight then some might wonder why 
not just bump the I/O count and put the request on a 
queue waiting for cache writes. This condition comes 
about because an earlier request has caused the 
entry to be put in the cache and it has started the 

so cache writes but all of them have not completed, as 
indicated by the changing state still being active. The 
problem is that when the first request started the 
cache writes, an association was make between it 
and all the PVs that needed to have their caches 

55 updated. At the point in time the second request 
enters the MWCM there is no way to know how many, 
if any, of these cache writes are complete. Therefore, 
it is not known how many associations to make for this 
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second request so that it can proceed when those 
cache writes are complete. So, this request is put on 
the cache hold queue also. This does not cause the 
request to lose much time because when adl the cache 
writes are complete this hold queue is locked at again. 
When this is done, this request will get a cache it and 
proceed immediately to be scheduled. 

In the above 2 test conditions, a statement indi- 
cated the cache entry would be moved to the head of 
the list As with all cache systems, there is an 
algorithm for handling the things in the cache. The 
MWCC uses a Most Recently Used/Least Recently 
Used(MRU/LRU) algorithm. When the MWCC is 
initialized the cache entries are linked together via 
pointers in part 2. These pointers refer to the next and 
previous cache entry. This is a doubly linked circular 
list with an anchor in the volume group structure. The 
anchor points to the first cache entry or the most 
recently used or modified. That entries next pointer 
points to the entry modified before the first one. This 
goes on until you get to the last entry in the cache, the 
least recently used or modified, and its next pointer 
points back to the first entry, i.e. the same one the 
anchor points to. Now the previous pointer does the 
same thing but in reverse. So, the first entries previ- 
ous pointer points to the last entry in the list, i.e. the 
least recently used entry. 

By using this type of mechanism several things 
are gained. First there is no free list When a cache 
entry is needed the last one (tRU) on the list is taken. 
If its I/O count is non-zero, then scan the cache 
entries, via the LRU chain, to find an entry with a zero 
I/O count If none are found the cache is full. This 
eliminates the need for counters to maintain the num- 
ber of entries currently in use. 

Note, however, that when the I/O count is non-ze- 
ro in the LRU entry, the cache cannot be assumed to 
be full. Although the LRU entry is known to have been 
in the cache the longest, that is all that is known. If a 
system has multiple disk adapters or the disk drivers 
do head optimizations on their queues the requests 
may come back in any order. Therefore, a request that 
may have been started after the LRU entry could fin- 
ish before the LRU requests, thereby making a cache 
entry available in the middle of the LRU chain. It is 
therefore desirable to have a variable that would 
count the number of times a write request had to hold 
due to the cache being full. 

If the MWCM is scanning the hold queue when the 
cache fills up, the MWCM should continue to scan the 
hold queue looking for any requests that may be in the 
same LTG as any requests just added from the hold 
queue. If any are found they can be removed from the 
hold queue and moved to the cache write waiting 
queue after incrementing the appropriate I/O count 

As mentioned earlier, there are hash queue 
anchors(8) in the volume group structure. In order to 
reduce cache search time, the entries are hashed 



onto these anchors by the LTG of the request This 
hashing is accomplished by methods commonly 
known in the prior art The entries on any particular 
hash queue are forwardly link via a hash pointer in 

5 part 2 of the cache entry. 

There are certain times when a cache clean up 
operation should be done. The most obvious is at LV 
close time. At that time the cache should be scanned 
for entries with a zero I/O count When an entry is 

10 found, it should be cleared and moved to the end of 
the LRU chain. Once the entire cache has been scan- 
ned, the PVs that this entry belongs to should also be 
written. Another time for a cache cleanup operation 
might be at the request of system management via an 

15 IOCTL. 

One other thing deserves to be mentioned here. 
What if the MWCC block on the disk goes bad due to 
a media defect? The MWCM will attempt to do a 
hardware relocation of that block if this condition is 

20 found, if that relocation fails or an non-media type 
error is encountered on a MWCC write, the PV is dec- 
lared missing. 

MWCC RECOVERY 

25 

We now know how the front side of the MWCC 
works. Remember the whole purpose of the MWCC is 
to leave enough bread crumbs lying around so that in 
the event of a system crash the mirrored LTGs that 

30 had write requests active can be found and made con- 
sistent This discussion will focus on the backside of 
the MWCC or the recovery issues. 

Recovery wBI be done only with the initial varyon 
operation. This is due to the need to inhibit normal 

35 user I/O in the VG while the recovery operations are 
in progress. 

The recovery operations are the very last phase 
of the VG varyon operation. This is because the entire 
VG must be up and configured into the kernel before 
40 any I/O can take place, even in recovery operations 
where care must be taken to not allow normal I/O in 
the VG until all the LTGs in flight have been made con- 
sistent 

The first step in the recovery process is selecting 
45 the latest MWCC from all the PVs available. Once this 
is done, the recovery of the LTGs in the selected 
MWCC becomes a simple task of issuing mirror write 
consistency recovery(MWCR) I/O requests to the 
LVs/LPs/LTGs that have an entry in the cache. This 
50 method is referred to as the fast path method 
because, the maximum number of recovery oper- 
ations is limited to the size of the MWCC. This in effect 
sets what the maximum recovery time for the VG is. 
In other words, using the selected MWCC do recovery 
55 on the LTG(s) if the parent LP has more than one non- 
stale PP copy. 

During these MWCR requests, if a mirror has a 
write fa3ure or is on a missing PV it wai be marked 
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stale by the driver, via the WHEEL 

Missing PVs do not add time to the recovery oper- 
ation at varyon time, but it may make mirrors stale that 
will need to be resynchronized later when the PVs 
come back online. There are 3 types of missing PVs. 
The first type is previously missing PVs, which are the 
PVs that are marked as missing in the VGSA. These 
previously missing PVs could have been missing at 
the last varyon, or the driver could have found and 
declared them missing while the VG was online. It 
makes no difference. The second type of missing PVs 
is the newly found missing PVs. These PVs were 
found by the varyon operation to be not available at 
this time, but the PV status in the VGSA indicates they 
were online the last time the VG was online. This 
could be caused by the drive or adapter failure that 
caused a loss of quorum and the VG was forcefully 
taken offline. Another cause of newly found missing 
PVs is when the PV was not included in the list of PVs 
to varyon when the varyonvg command was issued. 
There is one other way for a PV to fall into the newly 
found missing category, and that is when the MWCC 
cannot be read due to a read error of any kind, but the 
PV does respond to VGDA reads and writes. 

The previously missing PVs and the newly found 
missing PVs are combined into the final type of mis- 
sing PVs, the currently missing PVs. The currently 
missing PVs are the ones of importance to the current 
discussion. After the first phase of the recovery 
scenario, i.e. select a MWCC from the available PVs 
and do fast path recovery, the second phase is done. 
The second phase is done only if there are any cur- 
rently missing PVs. 

Actual recovery of LTGs on missing PVs may be 
impossible for a couple of reasons. The biggest of 
these reasons is the PV is missing and therefore will 
not accept any I/O. Another concern of missing PVs 
is when all copies of a LP/LTG are wholly contained 
on the missing PVs. This is a problem because there 
is no information about these LTGs avaflable to the 
recovery process. It is not known if write I/O was in 
flight to these LPs/LTGs. 

Therefore, it must be assumed there was I/O out- 
standing and the recovery process must do the right 
thing to insure data consistency when the PVs are 
brought back online. The correct thing to do in this 
case is mark all but one non-stale mirror stale, i.e.. for 
each LP in the VG, if the LP is wholly contained on the 
currently missing PVs, then mark all but one of the 
mirrors stale. When the PVs come back online the 
effected LPs will have to be resynchronized. 

A data storage system has been described which 
has an improved system throughput 

A storage hierarchy for managing a data storage 
system has also been described. 

A method of improved system throughput in a 
computer system where multiple copies of data are 
stored to aid in error recovery has also been des- 
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Claims 

5 

1 . A method for managing a plurality of data storage 
devices associated with a computer system and 
having a fust physical vdume and subsequent 
physical volumes and being partitioned into one 

10 or more logical volumes, each of said logical 
volumes being further partitioned into one or more 
logical partitions each of which comprises one or 
more physical partitions of said storage devices, 
said method comprising the steps of: 

15 determining status information for each of 

said physical partitions and recording said status 
information in a memory of said computer system; 

recording said status information in a 
status area existing on each of said data storage 

20 devices; 

creating updated status information when 
a write request is generated for any of said physi- 
cal partitions; 

updating said status area on said first 

25 physical volume with said updated status infor- 

mation; and 

updating said status area of each subse- 
quent physical volume within said storage 
devices in succession with said updated status 

30 information, wherein if a second or subsequent 

write request is received prior to completing an 
update of each of said storage device status 
areas as a result of a prior write request, said 
status information is updated in said computer 

35 memory and used in updating said next succeed- 

ing physical volume status area. 

2. A method as claimed in claim 1 wherein each of 
said physical partitions corresponding to a given 

40 logical partition contains duplicate data infor- 

mation. 

3. A method as claimed in claim 2 wherein if a sub- 
sequent request to change status information is 

45 received prior to completing an update of each of 
said storage device status areas as a result of a 
prior status change, said subsequent status 
change is recorded while recording status infor- 
mation resulting from said prior status change. 

50 

4. A computer system including means for manag- 
ing a plurality of data storage devices associated 
with said computer system and having a first 
physical volume and subsequent physical 

55 volumes and being partitioned into one or more 

logical volumes, each of said logical volumes 
being further partitioned into one or more logical 
partitions each of which comprises one or more 

21 
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physical partitions of said data storage devices, 
said managing means comprising: 

means for maintaining status information 
for each of said physical partitions in a memory of 
said computer system; 5 

recording means for recording said status 
information in a status area existing on each of 
said data storage devices; 

means for creating updated status infor- 
mation when a write request is generated for any 10 
of said physical partitions; 

first update means for updating said status 
area on said first physical volume with said 
updated status information; and 

subsequent update means for updating 15 
said status area of each subsequent physical 
volume within said data storage devices in suc- 
cession with said updated status information, 
wherein if a second or subsequent write request 
is received prior to completing an update of each 20 
of said data storage device status areas as a 
result of a prior write request, said status infor- 
mation is updated in said computer memory and 
used in updating said next succeeding physical 
volume status area 25 
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Flags - Basic information - READ/WRITE, Buffer Busy, Error 
Indicator. 

Pointers - Used to link these PBUFs onto various chains to 
control the flow. 

IODONE - PTR to a function to hand the PBUF to when request is 
complete, i. e. the lower level disk device drivers call the function 
pointed to by this field to return the request back to LVM when 
they are finished with the request. 

Device - Physical Device where the transfer will be done. 

Disk block # - Disk Block # where the transfer is to start from. 

Memory Address - Memory Address where the data is to be 
transferred to or from. 

Xfercount - Number of bytes to transfer. In the case of LVM this 
must be a multiple of disk blocks (512 bytes). 

Error Type - When the error indicator is on (True) in the flags 
field. This field indicates the type of error. Example Media Error, 
Invalid Request 

Residual Xfercount - If an error occurred on a Xfer. This field 
contains the number of bytes that were NOT transferred. 

PTR to Original Request - LVM receives requests from layers 
above the strategy layer. These logical requests are translated 
into one or more physical requests (PBUFs). When all physical 
requests for a given logical request are complete the logical 
requests can be returned to its originator. This is a backward link 
to the originating logical request. 

PTR to Scheduling Routine - Physical Requests are returned from 
the disk drivers, via the IODONE field, to the physical 

FIG. 17a 
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layer of LVM. The physical layer has responsibility for bad block 
processing. If the request is finished the physical layer will return 
the request to the scheduling layer via this pointer. The 
scheduling layer makes decisions concerning what must be done 
next to compute the logical request. 

Mirror - Mirror number associated with this PBUF, 0, 1 , or 2. 

Mirror Avoid - Bit mask (3 bits) indicating which mirrors are to be 
avoided or not used to satisfy the logical request, i. e. the mirror is 
broken, or on a physical volume that is not available: 

Mirror Bad - Bit mask (3 bits) that indicate which mirrors have had 
failures or are broken. 

Mirror Done - Bit mask (3 bits) that indicate which mirrors have 
completed the transfer. 

SW Retry - Software Retry count; how many times this block has 
had a software relocation attempted, 1 or 2. 

Type - Type of PBUF, used in error processing by the WHEEL 
and Bad Block Processing. Tells the WHEEL if this is a Make 
Stale PP Request, Mark PV Missing, or Make PP Fresh. 

Bad Block Operation - Used to control updating the bad Block 
directory that reside in the reserved area of all physical volumes. 

WHEEL Stop - The position this PBUF is to get off of the WHEEL 
when it is on the WHEEL. 
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r VGSA.H 7 

#ifndef HVGSA 
#definelH_VGSA 

r 

COMPONENT_NAME: (SYSXLVM) Logical Volume ManagerDevice Driver - vgsah 



7 

include < sys/param.h > 
include < sys/dasd.h > 

r 

* LVDD internal macros and defines used by Volume Group Status 
*Area(VGSA) logic 

7 

#define RTN_ERR 1 f Return requests from the VGSA7 

r wheel with ENXIO errors 7 

#define RTN_NORM 0 /* Return requests from the VGSA7 

r wheel without explicitly 7 
r turning on the B_ERROR flag 7 

#define VGSA_BLK 8 /* VGSA length in disk blocks 7 
#define VGSA_SIZE (VGSA BLK * DBSIZE) f VGSA length in bytes 7 
#define VGSA_BT_PV 127 T VGSA bytes per PV 7 

I* 

* This structure limits the number of Physical Partitions(PP) that can be 

* present in the VG to 32,512. The stalepp portion is divided equally 

* between the 32 possible PVs of the VG. This gives each PV 127 bytes 
*or1016PPs. 

7 

FIG. 21-1 
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struct vgsaarea{ 

struct timestrucj bjmstamp; f Beginning time stamp V 

r Bit per PV V 
ulong pv missingff[(MAXPVS + (NBPL - 1 )) / NBPL]; 

/* Stale PP bits V 
uchar staleppfpAXPVS]ff[VGSA_BT_PVl; 
char pad2ff[12];rPaddng V 
struct timestrucj ejmstamp; /* Ending time stamp / 

); 
r 

* Macros used to set/clear/test pvmissing and staiepp bits in a vgsa_area 

* struct The ptr argument is assumed to be a ptr to the vgsa_area 

* structure. All other arguments are assumed to be zero relative. 

* This allows LVM library functions to use these macros. 

* NOTE these macros will not work if the max number of PVs per VG is 

* greater than 32. 
V 

#define SETSA_PVMISS(Ptr, Pvnum) \ 

((Ptr)-> pv_missingff[(Pvnum)/NBPL] |= (1 < < (Pvnum))) 
#define CLRSA_PVMISS(Ptr, Pvnum) \ 

((Ptr)-> pv_missingff[(Pvnum)/NBPL] &= ("(1 < < (Pvnum))) 
#define TSTSA PVMISS(Ptr, Pvnum) \ 

((Ptr)-> pv^missingff[(Pvnum)/NBPL] & (1 < < (Pvnum))) 

#define SETSA_STLPP(Ptr, Pvnum, Pp) \ 

((Ptr)-> staleppff[(Pvnum)]ff[(Pp)/NBPB] |= (1 < < ((Pp) % NBPB))) 

#define CLRSA_STLPP(Ptr, Pvnum, Pp) \ 

((Ptr)-> S yeppff[(Pvnum)]ff[(Pp)/NBPB] &= f(1 « ((Pp) % NBPB)))) 

#define XORSA_STLPP(Rr, Pvnum, Pp) \ 

((Ptr)-> staleppff[(Pvnum)]ff[(Pp)/NBPB] = (1 < < ((Pp) %NBPB))) 

FIG. 21-2 
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#define TSTSA_STLPP(Ptr, Pvnum, Pp) \ 

((Ptr)-> staleppffl(Pvnum)lff[(Pp)/NBPB] &= (1 < < ((Pp) % NBPB))) 

r 

* Macros used to set/retrieve the logical sector number and sequence number 

* associated with each VGSA. 
V 

#define GETSAJ_SN(Vg, Idx) \ 

((Vg-> pvolsff[(ldx) > > 1]-> sa_areaff[(ldx)&1].lsn) 

#deftne SETSA_LSN(Vg, Idx, Newlsn) \ 

((Vg-> pvolsff[(ldx) > > 1]-> sa_areaifl(ldx)&1].lsn = (Newlsn)) 

#define GETSA_SEQ(Vg, Idx) \ 

((Vg-> pvolsff[(ldx) > > 1]-> sa_areaff[(ldx)&1].sa_seq_num) 

#define SETSA_SEQ(Vg, Idx, Seq) \ 

(((Vg-> pvolsffRldx) > > 1]-> sa_areaff[(ldx)&1].sa_seq_num) = (Seq)) 

#defme NUKESA(Vg, Idx) \ 

((Vg-> pvolsff[(ldx) > > 1]-> sa_areaff[(ldx)&1].nukesa) 

#define SET_NUKESA(Vg, Idx, Rag) \ 

((Vg-> pvolsffRldx) > > 1]-> sa_areaff[(ldx)&1].nukesa = (Rag)) 

r 

* The following structures are used by the config routines to pass 

* information to the hd_sa_config() function for stale/fresh PP and 

* install/delete PV processing. A pointer to an array of these 

* structures is passed as an argument 

V 

r 

* An array of these structures is terminated with both the pvnum and pp 

* equalling -1. 
*/ 
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struct cnfg_pp_state{ 
short" pvnum; f PV number the PP is on V 
short pp; r PP number to mark stale/fresh 7 
int ppstate; /* state to mark PP stale/fresh 7 

}; 
r 

* passed in as arg when a CNFGJEXT request is done. 
7 

struct sa_ext{ 

struct Ivol *Wv_ptr; /* ptr to Ivol struct being extended 7 

short nparts; /* number of copies of the Iv 7 

char isched; . I* scheduling policy for the Iv 7 

char res; * T padding 7 

ulong nblocks; /* length in blocks of the Iv */ 

struct part **new_parts; f ptr to new part struct list 7 

int old_numlps; /* 0 ld number of logical partitions on Iv 7 

int oldjiparts; /* previous number of partitions on Iv 7 

int error Terror to return to library layer 7 

struct cnfgjjpjtate *vgsa; /* ptr to pp info structure 7 



r 

* passed in as arg when a CNFG_RED request is done 

7 

struct sa_red{ 
struct Ivol *lv; I* ptr to Ivol struct being reduced 7 
short nparts; f number of copies of the Iv 7 
char isched; I* scheduling policy for the Iv 7 

char res; t reserved area 7 
ulong nblocks; f length in blocks of the Iv 7 
struct part "newparts; /* ptr to new part struct list 7 
unsigned short minjium; f minor number of logical volume 7 
int numlps; I* number of Ips on Iv after reduction 7 
int numred; f number of pps being reduced 7 

FIG. 21-4 
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int error; r error to return to library layer V 

struct extredjpart # list; /* list of pps to reduce 7 

}; 
r 

* install PV information for VGSA config routine 
7 

struct confg_pv_ins{ 
struct pvol* pvol; /* PV to install or remove 7 
short qrmcnt: f new VG quorum count 7 
short pvjdx; f index into vg's pvol array 7 

}; 
r 

* delete PV information for VGSA config routine. Also used for remove PV 

* and missing PV. (qrmcnt will not be used for missing PV) 

struct cnfgj)vjdel{ 
struct pvol * pv_ptr; /* pointer to pvol struct to remove */ 
struct part * lp_ptr; f pointer to UALV's LP struct to zero 7 
short Ipsize; f size of DALV's LP 7 
short qrment; /* VG's new quorum cnt once this PV is deleted 7 

}; 
r 

* information to add/delete a VGSA from a P V 

7 

struct confgLpvj/gsa { 
struct pvol * pv_ptr; f pointer to pvol struct to remove 7 
daddrj sa_lsnsff[2]; f LSNs for VGSAs added or 0 if deleted or 

if a copy not being added 7 
short qrment; I* VG's new quorum cnt once this PV is deleted 7 

}; 
r 

* The following defines are used by the VGSA write operations These 

* defines indicate what action the pbuf is requesting. It is stored 

FIG. 21-5 
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* in the pbjype field of the pbuf. 
7 

#define SA_PVMISSING 1 /* PV missing type pbuf 7 

#define SA.STALEPP 2 r Stale PP type pbuf 7 

#define SA_FRESHPP 3 r Fresh PP type pbuf 7 

#define SA_CONRGOP 4 f hd_config function type pbuf 7 

#define SA_PVREMOVED 5 f PV removed type pbuf 7 

r 

*The following defines are used by the config routines to set up 

* the cnfg_pp_state fields 
7 

#define STALEPP 1 
Idefine FRESHPP 0 
#define CNFG_STOP -1 
#defineCNFG_NEWCOPY -1 

#endiff H VGSA7 



r LIBLVM.H 

* C0MPONENT_NAME: (liblvm) Logical Volume Manager 

* © COPYRIGHT International Business Machines Corp. 1988, 1990 

* All Rights Reserved 

7 

#ifndef_H_LIBLVM 
#define_HJJBLVM 

#include < lvm.h > 
#include < sys/dasd.h > 
include < sys/bootrecord.h > 

#ifndef TRUE 

FIG. 21-6 



MSDOCID; <EP 04828S3A2_L> 



EP0 482 853 A2 



#defineTRUE 1 
#end'rf 

#ifndef FALSE 

#define FALSE 0 

#endif 

#ifndef NULL 

#define NULL ((void *) 0) 

#endtf 



* Error codes used internally by the library. These are noteturned 

* to the user. NOTE that these values start at 500 so they witiot 

* conflict with the error values in Ivm.h which are returned to the 

* user. 
7 



#define LVMJBRDERR -500 

#define LVM.BBWRERR -501 

#define LVM.PVRDRELOC -502 

#define LVMJBINSANE -503 



I* read error on bad block directory 
r write error on bad block directory 
r put PV in read only relocation 
r bad block directory is not sane 



r 

* General defines 
7 

#define LOCKALL 0 
#define CHECKJAAJ 1 
#define NOCHECK 0 
#define FIRSTJNDEX 0 
#define SECJNDEX 1 
#define THIRDJNDEX 2 
#define NO.COPIES 0 
#define ONE_COPY 1 
#define LVM.FNAME 72 
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#define LVM_NOLPSYET 0 

#define LVM__REDUCE 1 

#define LVM.EXTEND 2 

#define LVM_RRST 1 

#define LVM_SEC 2 

Mine LVM JHIRD 3 

#define LVMJASTPV 0 

#define LVM_CASEGEN 1 

#defineLVM_CASE2T01 2 

#defineLVM_CASE3T02 3 

#define LVM_GETSTALE 1 

#define LVMJIOSTALE 2 

#defineLVMJ>VNAME 1 

#define LVM_VGNAME 2 

#define LVMJ.VDDNAME "hd_pin" 

#define LVM_KMIDFILE 7ete/vg/lvdd_kmid" 

#define LVM_STREAMRD T 

#define LVM_STREAMWR V 

#define MAPNOTOPEN -1 I* Mapped file is not open*/ 

r 

* GENERAL LVVALUES 

V 

#define LVM JNITIAL_LPNUM 0 



#define LVM_LVMID 0x5F4C564D r LVM id field = "_LVM" 7 
#define LVM_SLASH 0x2F r hex value for ASCII slash 7 
#define LVM_NULLCHAR VT /* null character 7 
#define LVM DEV 7devf f concatenate to device names 7 

r 

#define LVM JXTNAME (sizeof (LVN JIAMESE) + sizeo(LVM_DEV) + 1 ) 

r size of extended device names 7 
#define LVMJXTNAME 72 
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LETCVG 7etc/vg/vg° 

P concatenate to VG id for map filename 



7 



#define LVM 
#define LVM 



REL0C_LEN 256 
RELOCMASK 0x8 



#define LVM 



FROMBEGIN 
NOVGDALSN 
FILENOTOPN 
1 



0 
-1 

-1 



0 



#define LVM 
define LVM" 
#define LVM 
#d@fin@ LVM 
#defineLVM" 
#define LVtf 
#define LVM" 



1 



#defin@ IVM 
#define LVM 
#define LVM 
#define LVM' 
#define LVM 
#define LVM 
#define LVM" 
#define LVM' 
#d@fine LVM' 
#define LVM' 
define LVM 



P length in blocks of BB reloc pool 7 

P mask value to check BB relocation 7 
P seek value is offset from beginning 7 
P no desc LSN defined for this entry 7 
P file is not currently open 7 
fwriteVGDAforthisPV7 
P minor number of descriptor area LV 7 
P index for primary VGDA/VGSA LSN 7 
P index for secondary VGDA/VGSA LSN 7 

P permissions for open of mapped file 7 
P PV number of first physical volume 7 
P write order of beginning/middle/end 7 
P write order of middle/beginning/end 7 
P zero end timestamp, then write b/m/e 7 
r first timestamp greater than second 7 
P first timestamp equals second 7 
P first timestamp less than second 7 
P read error on timestamp 7 
BTSEQETS LVM_EQUAL P begin timestamp = end ts 7 
BTSGTETS LVM GREATER P begin timestamp > end ts 7 



SCNDRY 
"MAPFPEI 
1STPV 1 



MIDBEGEND 
ZEROETS 



0 

1 



EQUAL 2 
"LESS 3 
TSRDERR 



#define LVM 
#define LVM 
define LVM' 
define LVM" 
#define LVM; 

P 

* Macros 

V 



DAPVSJTL1 1 
TTLDASJPV 2 
DAPVSJTL2 2 
"TTLDASJPV 3 
DASPERPVGEN 
BBCHGRBLK 1 
BBCHGSTAT 2 
STRCMPEQ 0 
BBRDONLY 1 
BBRDINIT 2 
BBRDRECV 3 
BBPRIM 1 
BBBACK 2 



P total of 1 PV with VGDA copies 7 
P total number VGDA copies on 1 PV 7 
P total of 2 PVs with VGDA copies 7 
P total number VGDA copies on 2 PVs 7 
1 P number VGDAs per PV for general case 7 
P change relocation block of bad block 7 
P change status field of bad block 7 
P string compare result of equal 7 
P read a bad block directory 7 
P read and initialize a bad block directory 7 
P read and recover bad block directories 7 
P use the primary bad block directory 7 
P use the backup bad block directory 7 
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#define LVM_SIZTOBBND(Size) ((((Size) + DBSIZE - 1) / DBSIZE) * DBSIZE) 

#define LVM JBDIRLEN(BbJidr) (LVM_SinOBBND(sizeof (struct bb_hdr) +\ 
(Bb_hdr->num_entries * sizeof(struct bb_entry)))) 

#define LVM _MAPFN(Mapfn, Vgid) \ 

(sprintf ((Mapfn), "%s%8.8X%8.8X", LVMJTCVG, \ 
(Vgid) ->word1, (Vgid) ->word2)) 

#define LVM JUILDNAME(name,maj,min) \ 
(sprirrtf((name);%sy^^^^ 

#define LVM_BUILDVGNAME(name,maj) \ 
(sprintf((name^^ 

#define LVM_PPLENBLKS(Ppsize) (1 < < ((Ppsize) - DBSHIFT)) 

#define LVM_PSNFSTPP(Lvmareastart, Lvmarealen) \ 
(TRK2BLK (BLK2TRK (Lvrnareastart + Lvmarealen - 1 ) + 1 )) 

r 

* The following is the file header structure that gives indexes and 

* general information about the volume group descriptor area structures 
V 

struct dajnfo { 

daddrj dalsn; /* logical sector number of VGDA copy 7 
struct timestrucjts; rtimestampofthisVGDAcopyV 

}; 

struct fheader { 

long vginx; f byte offset for vg header*/ 
long Ivinx; f byte offset for Iv entries*/ 

long pvinx; I* byte offset for pv entries 7 

long endpvs; f byte offset for end of last PV entry */ 

long namejnx; /* offset for the name area */ 

long trailinx; /* byte offset for the vg trailer 7 
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long majorjium; f major number of volume group */ 
long vgdajen; /* length in blocks of the VGDA V ' 
char vgnameff[LVM_NAMESIZl; I* name of volume group */ 
long quorumjcnt; /* number of vgdas needed for varyon 7 
long pad1 ; TpadV 

short int num_desclps; I* number of LPs per PV for the VGDA LV 7 
struct pvinfo { 

char pvnamefpMJIAMESIZ]; /* PV name 7 

struct unique__id pvjd; I* id of physical volume */ 

long pvinx; /* byte offset to PV header 7 

devj device; f major/minor number 7 

short pad2; f pad V 

short pad3; /*padV 

struct dajnfo da ff[LVM_PVMAXVGDAS]; r info on this VGDA copy 7 
) pvinfoff[LVM_MAXPVS]; /* information about each PV 7 



/* II. Volume Group Descriptor Area 7 

struct vgjieader 

struct timestrucj vgjmestamp; f time of last update V 

struct uniquejd vgjd; f unique id for volume group */ 

short numlvs; f number of Ivs in vg*/ 

short maxlvs; f max number of Ivs allowed in vg V 

short pp_size; f size of pps in the vg V 

short numpvs; I* number of pvs in the vg */ 

short total_vgdas; /* number of copies of vg V 

r descriptor area on disk 7 
short vgda_size; I* size of volume group descriptor V 

r area 7 

}; 
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struct lv_entries 

' short Ivname; PnameofLV*/ 
short res1; P reserved area*/ 
long maxsize P maximum number of partitions allowed 7 
char Ivjtate; P state of logical volume*/ 
char mirror; Pnone,single, or double V 
short mirrorj)olicy;P type of writing used to write*/ 
long numjps; P number of logical partitions on the Iv7 
Pbasel*/ 

char permissions; P read write or read only*/ 
char bbjelocation; P specifies if bad block */ 

P relocation is desired */ 
char write_verify; P verify all writes to the LV7 4 
char mirwrt_consist; P minor write consistency flag */ 
long res3; P reserved area on disk */ 
double res4; Preserved area on disk*/ 

}; 



struct pvjieader 

struct uniquejd pvjd; P unique identifier of PV 7 
unsigned short pp_count; P number of physical partitions 7 



PonPV*/ 

char pv_state; P state of physical volume */ 

char res1 ; P reserved area on disk 7 

daddrj psn_part1 ; P physical sector number of 1st pp 7 
short pvnurnvgdas; P number of vg descriptor areas 7 

P on the physical volume*/ 
short pv_num; PPV number 7 

long res2; P reserved area on disk 7 



struct pp_entries 

{ 

FIG. 21-12 
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short Ivjndex; /* index to Iv pp is on 7 
short resj; /* reserved area on disk*/ 
long lp_num; /* log. part number 7 
char copy; /* the copy of the logical partition V 

/* that this pp is allocated for V 
char ppjtate; f current state of pp7 
char fst_alt_vol;/* pv where partition allocation for*/ 

/•first mirror begins*/ 
char snd_alt_vol;/* pv where partition allocation for*/ 

/* second mirror begins*/ 
short fst_alt_part; f partition to begin first mirror */ 
short snd_altj)art; /* partition to begin second mirror */ 
double res 3; /* reserved area on disk */ 
double resj; /* reserved area on disk */ 



struct namelist 

' char nameff[LVM_MAXLVS]ff[LVM_NAMESlZJ; 

}; 

struct vgjrailer 
{ 

struct timestrucj timestamp; I* time of. last update 7 
double resj; /* reserved area on disk 7 
double res_2; I* reserved area on disk 7 
double res 3; /* reserved area on disk 7 

}; 



/* 

* The following structures are used in Ivrnvaryonvg 
7 

struct dajajnfo 

FIG. 21-13 
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1* structure to contain timestamp information about a 
volume group descriptor or status area 7 

{ 

struct timestrucj ts_beg; 
f beginning timestamp value V 
struct timestrucj ts_end; 
r ending timestamp value 7 
short inttsjstatus; 

r indicates if read error on either timestamp, or if both good 
indicates if beginning ts equal or greater than ending ts 7 
short intwrt_orden 

r indicates order in which to write this VGDA or VGSA copy V 
short intwrtjtatus; 

I" indicates whether this VGDA or VGSA copy is to be written 7 

}; 

struct inpvsjnfo 

r information structure for PVs in the user's input list 7 

{ 

struct 

{ 

intfd; 

/* file descriptor for open of physical disk 7 
struct unique Jdpvjd; 
f the unique id for this physical volume 7 
devj device; 
T the major/minor number of the physical volume 7 
daddrj da_psn ff[LVM_PVMAXVGDAS]; 
f physical sector number (PSN) of beginning of the 
volume group descriptor area (primary and secondary 
copies), or 0 if none I* 
daddrj sajjsn ff[LVM_PVMAXVGDAS]; 
r PSN of beginning of the volume group status area 

(primary and secondary copies), or 0 if none /* 
daddrj reloc_psn; 
I* PSN of the beginning of the bad block relocation 
pool 7 

FIG. 21-14 
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long relocjen; 
f the length in blocks of the bad block relocation 

pool*/ 
short intpv_num; 

r the number of the physical volume 7 
short intpv_status; 
r status of the physical volume 7 
#define LVMJIOTVLDPV 0 f non valid physical volume 7 
#define LVM.VAUDPV 1 f valid physical volume 7 
struct dajaWo da fpM_PVMAXVGDAS]; 
r array of structures to contain timestamp information 

about VGDAs on one PV/* 
short int indexjiewestda; 
r index of VGDA copy on PV which has newest timestamp r 
short int index_nextda; 

r index of VGDA copy on PV which is next written 7 
}pvfpM_MAXPVS]; 

r array of physical volumes, indexed by order in input 
parameter list 7 
long Ivmareajen; 

f the length of the entire LVM reserved area on disk 7 
long vgdajen; 

r length of the volume group descriptor area 7 
long vgsajen; 

r length of the volume group status area 7 
short intnum_desclps; 

r the number of logical partitions per PV needed for the 

descriptor / status area logical volume f 
short int ppjsize; 

I* the size of a physical partition for this volume group 7 

}; 

struct mwcjnfo 

r structure to contain timestamp information about a mirror 
write cache area 7 

( 

struct timestrucjts; 
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f timestamp value*/ 
short iritgoodmwcc; 

r flag which indicates if the MWCC could not be read 7 
short intwrt_status; 

r indicates whether this MWCC is to be written V 

}; 

struct defpvsjnfo 

r information structure for PVs defined into the kernel 7 

{ 

struct 

short int injndex; 

r corresponding index into the input PV inf orrriation 

structure for this PV 7 
short intpvjstatus; 

r indicates if this PV is defined into the kernel 7 
#define LVMJIOTDFND 0 f this PV not defined in kernel */ 
#define LVM_DEFINED 1 t this PV defined in kernel V 

struct dajajnfo safpM.PVMAXVGDAS]; 

r array of structures to contain timestamp information 
about VGSAs on one PVV 

struct mwcjnfo mwc; 

r structure to contain information about the mirror 

write consistency cache on this PV */ 
}pvff[LVM_MAXPVS]; 
F array of physical volumes indexed by PV number 7 

int totai_vgdas; 

1* total number of volume group descriptor/status areas 7 
struct timestrucj newestjdats; 

r newest good timestamp for the volume group descriptor area 

struct timestrucj newest_sats; 

f newest good timestamp for the volume group status area 7 

struct timestrucj newest_mwcts; 

r timestamp for newest mirror write consistency each 7 

); 
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r 

* Function declarations 
7 

#ifndef_NO_PROTO 

r 

* bbdirutl.c 

7 

intlvm_bbdsane( 
char*buf); 

r buffer containing the directory to check 7 

intlvmjgetbbdirf 
int pvjd, 

I* the file descriptor for this physical volume device */ 
char* but, 

I* a buffer into which the bad block directory will be read */ 
intdirjlg); 

r flags to indicate which directory to read V 

intlvmjdbbdir( 
intpvjd, 

I* the file descriptor for this physical volume device V 
char*buf, 

I* a buffer into which the bad block directory will be read */ 
int act Jg); 

I* flag to indicate type of action requested 7 
int lvm_wrbbdir ( 
int pvjd, 

r the file descriptor for this physical volume device 7 
char * bbdir_buf, 

r a buffer containing the bad block directory 7 
intdirjlg); 
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r flags to indicate which directory to write 7 



r 

* bblstulc 
V 

voidlvm_addbb( 

struct bad.blk ** headjrtr, ........ 

r a pointer to the pointer to the head of the bad block linked list / 
struct badj)lk*bb_ptr); 

f pointer to the bad block structure which is to be added to the 
list 7 

int Ivmjbldbblst ( 

intpv_fd, . 
r the file descriptor for this physical volume device / 
struct pvol * pvol J3tr, . 
f a pointer to a structure which describes a physical volume for the 

logical volume device driver (LVDD) */ 
daddrj reloc_psn); . 
I* the physical sector number of the beginning of the bad block 
relocation pool V 

void Ivm chgbb ( 

struct bad_blk*head_ptr, 

r a pointer to the head of the bad block linked list 7 
daddrj bad_blk, 

r the bad block whose data is to be changed 7 
daddrjt reloc__blk, 

I* the new value for the relocation block, if it is to be changed / 
intchgtype); 

r type of change requested (change relocation block or status field) 
for this bad block 7 
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* chkquorum.c 
V 

intlvm_chkquorum( 
struct varyonvg*varyonvg, 

f pointer to the structure which contains input parameter data tor 
the Ivrnvaryonvg routine */ 

the* file descriptor for the volume group reserved area logical 
volume*/ 
struct inpvsjnfo * inpvsjnfo, 

f structure which contains information about the input list ot Pvs 

for the volume group*/ 
struct defpvsjnfo * defpvsjnfo, 1 
r structure which contains information about volume group descnptor 
areas and status areas for the defined PVs in the volume group / 
caddrjvgdajtr, 

/•pointer to the volume group descriptor area / 
struct vgsa_area **vgsa_ptr, 

I* pointer to the volume group status area */ 
daddrj vgsajsn ff[LVM_MAXPVS] ff[LVM_PVMAXVGDAS], 

/* array in which to store the logical sector number addresses of the 

VGSAs for each PV7 
charmwccff[DBSIZE]); 

r buffer in which the latest mirror write consistency cache will be 
returned */ 

intlvm_vgsamwcc( 
intvgjd, 

r file descriptor for the VG reserved area logical volume which 
contains the volume group descriptor area and status area */ 
struct inpvsjnfo* inpvsjnfo, 

f structure which contains information about the input list of PVs for 

the volume group*/ 
struct defpvsjnfo * defpvsjnfo, 

r pointer to structure which contains information about PVs defined 
into the kernel 7 
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caddr_tvgda_ptr, 

r pointer to the volume group descriptor area 7 
long quorum, 

r number of VGDAs/VGSAs needed to varyon in order to ensure that the 

volume group data is consistent with that from previous varyon 7 
struct vgsa_area ** vgsa_ptr, 

I* variable to contain the pointer to the buffer which will contain 
the volume group status area 7 
daddrj vgsajsn ff[LVM_MAXPVS] ff[LVM_PVMAXVGDAS], 
r array in which to store the logical sector number addresses of the 

VGSAs for each PV7 
charmwccff[DBSIZE]); 

I* buffer in which the latest mirror write consistency cache will be 
returned */ 



r 

*comutl.c 

7 

intlvm_chkvaryon( 
struct unique Jd*vg_id); 
r the id of the volume group 7 

void Ivmjnapoff ( 

struct fheader * mapfilehdr, 

r a pointer to the mapped file header which contains the offsets of 

the different data areas within the mapped file 7 
caddrjvgda_ptr); 

T a pointer to the beginning of the volume group descriptor area 7 

intlvm_openmap( 
struct uniquejd * vgjd, 
f a pointer to the volume group id 7 
int mapf_mode, 

1* the access mode with which to open the mapped file 7 
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int*vgmapjd, 

r pointer to the variable in which to return the file descriptor of the 

mapped file 7 
caddrj'vgmapjrtr); 

I* pointer to the variable in which to return the pointer to the 
beginning of the mapped file 7 

intlvm_relxmwcc( 
intpvjd, 

I* file descriptor of physical volume where block containing the mirror 

write consistency cache needs to be relocated 7 
charmwccffpBSIZE]); 

r buffer which contains data to be written to mirror write consistency 
cache 7 

intlvm_rdiplrec( 
intpvjd, 

r the file descriptor for the physical volume device 7 
IPLRECPTR ipljec); 

r a pointer to the buffer into which the IPL record will be read 7 

int Ivmjscomp ( 
struct timestrucj * ts1 , 

T first timestamp value 7 
struct timestrucj * ts2) ; 

r second timestamp value 7 

intlvm_updtime( 

struct timestrucj * begjime, 

1* a pointer to the beginning timestamp to be updated 7 
struct timestrucj * end Jme); 

1* a pointer to the ending timestamp to be updated 7 

r 

* crtinsutl.c 
7 
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intlvmjnitbbdir( 
int pyjd, 

r the file descriptor for the physical volume device */ 
daddrjt reloc_psn); 

/* the physical sector number of the beginning of the bad block 
relocation pool */ 

void Ivm jnitlvmrec ( 
struct Ivmjec * Ivmjec, 
r pointer to the LVM information record 7 
short int vgda_size, 

r the length of the volume group descriptor area in blocks V 
short int ppsize, 

^physical partition size represented as a power of 2 */ 
long data_capacity); 

r the data capacity of the disk in number of blocks 7 
int Ivmjnstsetup ( 
struct unique Jd * vgjd, 

/* pointer to id of the volume group into which the PV is to be 

installed 7 
char*pvjiame, 

r a pointer to the name of the physical volume to be added to the 

volume group 7 
short int override, 

r flag for which a true value indicates to override a VG member error, 
if it occurs, and install the physical volume into the indicated 
volume group 7 

struct unique Jd * cur_vg_id, 

r structure in which to return the volume group id, if this PV's 

LVM record indicates it is already a member of a volume group 7 
int * pvjd, 

r a pointer to where the file descriptor for the physical volume 

device will be stored 7 
IPL.RECPTR ipljec, 

r a pointer to the block into which the IPL record will be read 7 
struct Ivmjec *lvm_rec, 
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r a pointer to the block into which the LVM information record will 

be read 7 
long * data_capacity); 

r the data capacity of the disk in number of sectors 7 

void Ivm jwentry ( 
struct uniquejd * pvjd, 

/* pointer to a structure which contains id for the physical volume for 

which the entry is to be created 7 
struct vgjieader * vghdr_ptr, 

f a pointer to the volume group header of the descriptor area 7 
struct pvjieader ** pvptr; 

/* a pointer to the beginning of the list of physical volume entries 

in the descriptor area 7 
long num_parts, 

r the number of partitions available on this physical volume 7 
daddrj beg_psn, 

r the physical sector number of the first physical partition on this 

physical volume 7 
short intnum_vgdas); 

I* the number of volume group descriptor areas which are to be placed 
on this physical volume 7 

intlvm_vgdas3to3( 
intlvjd, 

r the file descriptor of the LVM reserved area logical volume 7 
caddrj vgmapjrtr, 

I" pointer to the beginning of the mapped file 7 
short int new_pv, 

r the PV number of the new physical volume which is being added 7 
short int save_pv_2 7 

1* the PV number of the physical volume which previously had two copies 

oftheVGDA7 
short int save_pvj); 

I* the PV number of the physical volume which previously had one copy 
oftheVGDAV 
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int Ivrnvgmem ( 
struct uniquejd * pvjd, 

r pointer to id of the physical volume for which we are to determine 

membership in the specified VG 7 
caddrj vgda_ptr); 

r pointer to the beginning of the volume group descriptor area 7 

intlvm_zereomwc( 
int pvjd, 

r the file descriptor of the physical volume 7 
short int newvg); 

r flag to indicate if this is newly created volume group 7 

intlvm_zerosa( 
intlvjd, 

r the file descriptor for the LVM reserved area logical volume 7 
daddrj sajsn ff[LVM_PVMAXVGDAS]); 
r the logical sector numbers within the LVM reserved area logical 

volume of where to initialize the copies of the volume group status 

area V 



r 

* configutl.c 
7 

int lvm_addmpv ( 
struct unique Jd * vgjd, 

r the volume group id of the volume group which is to be added into 

the kernel 7 
long vgjnajor, 

f the major number where the volume group is to be added 7 
short int pv_num); 

I* number of the PV to be deleted from the volume group 7 

int lvm_addpv ( 
long partlen_blks, 
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/* the length of a partition in number of 51 2 byte blocks 7 
short int numjdesclps, 

r the number of partitions needed on each physical volume to contain 

the LVM reserved area 7 
devj device, 

r the major / minor number of the device 7 
intpvjd, 

r the file descriptor of the physical volume device 7 
short int pvnum, 

r the index number for this physical volume 7 
long vgjnajor, 

r the major number of the volume group 7 
struct uniquejd 4 vg_id, 

r the volume group id of the volume group to which the physical 

volume is to be added 7 
daddrj relocj)sn, 

r the physical sector number of the beginning of the bad block 

relocation pool 7 
long relocjen, 

r the length of the bad block relocation pool 7 
daddrj psn_part1, 

r the physical sector number of the first partition on the physical 
volume 7 

daddrj vgsaJsnff[LVM_PVMAXV6DAS], 
short intquorum_cnt); 

r the number of VGDAs/VGSAs needed for a quorum 7 

intlvm_chgqrm( 
struct uniquejd *vgjd, 
r the volume group id of the volume group 7 
long vg_major, 

r the major number of the volume group 7 
short int quorum_cnt); 

r number of VGDA/VGSA copies needed for a quorum 7 
intlvm_chgvgsa( 
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struct unique Jd * vgjd, 
F the volume group id 7 
long vgLmajor, 

F the major number of the volume group */ 
daddrj vgsajsn fpM_PVMAXVGDAS], 

F array of logical sector number addresses of the VGSA copies on this 

PVV 
short intpvnum, 

F number of the PV which is to have changes to the number of VGSAs V 
short intquorum_cnt, 

F number of VGDAs/VGSAs needed for a quorum V 
int command); 

F command value which indicates the config routine to be called is 
• that for adding/deleting VGSAs 7 

intlvm_chkvgstat( 
struct varyonvg * varyonvg, 

F pointer to the structure which contains input information for 

varyonvg */ 
int * vgstatus); 

F pointer to variable to contain the varied on status of the volume 
group V 

int Ivmjconfig ( 
midj kmid, % 

F the module id for the object module which contains the logical 

volume device driver*/ 
long vgjnajor, 

F the major number of the volume group 7 
int request, 

F the request for the configuration routine to be called within the 

kernel hd_config routine 7 
struct ddijnfo * cfgdata); 

F structure to contain the input parameters for the configuration 
device driver 7 

int lvmjdefvg ( 
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long partienjblks, 

r the length of a partition in number of 512 byte blocks 7 
short intnumjdesclps, 

r the number of partitions needed on each physical volume to contain 

the LVM reserved area 7 
midjkmid, ' 
r the module id which identifies where the LVDD code is loaded / 

longvgLmajor, 

f the major number where the volume group is to be added / 
struct uniquejd * vgjd, 

r the volume group id of the volume group which is to be added into 

the kernel 7 
short int ppsize, 

r the physical partition size, represented as a power of 2 of the 
size in bytes, for partitions in this volume group 7 

long noopenjvs); 

r flag to indicate if logical volumes in the volume group are not 
allowed to be opened 7 

int lvm_delpv ( 

struct uniquejd *vg_id, 

f the volume group id of the volume group which is to be added into 
the kernel 7 
long vgjnajor, 

r the major number where the volume group is to be added 7 
short intpvnum, 

r number of the PV to be deleted from the volume group 7 
short int numjdesclps, 

r number of logical partitions in the descriptor / status area 

logical volume for this PV 7 
int flag, 

f flag to indicate whether the PV is being deleted from the volume 

group or just temporarily removed 7 
short intquorum_cnt); 
r quorum count of logical volume 7 
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void lvm_delvg ( 
struct uniquejd'vgjd, 

} the volume group id of the volume group which is to be added into 

the kernel*/ 
longvg_major); 

r the major number where the volume group is to be added V 



r 

*lvmrecutl.c 
7 

void lvm_cmplvmrec( 

struct unique Jd *vgid, f pointer to volume group id 7 

char 'match, I* indicates a matching vgid V 

char pvnameff[LVM_NAMESIZ|); /* name of pv to read Ivm rec from 

intlvm_rdlvmrec( 
intpvjd, 

r the file descriptor for the physical volume device */ 
struct Ivmjec * Ivmjec); 

I* a pointer to the buffer into which the LVM information record will 

be read 7 
int Ivrnwrlvmrec ( 
int pvjd, 

I* the file descriptor for the physical volume device 7 
struct Ivmjec * Ivmjec); 

I* a pointer to the buffer which contains the LVM information record 
to be written 7 

void lvm_zerolvm ( 
intpvjd); 

I* the file descriptor for the physical volume device 7 



r 
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* queryutl.c 
V 

extern int lvm_chk!vclos ( 
struct Ivjd * Ivjd, 
r logical volume id V 
long majorjium); 

r major number of volume group V 

extern int Ivmjgetpvda ( 
char * pvjiame, 

r a pointer to the name of the physical volume to be added to the 

volume group*/ 
char" mapjrtr, 

f a pointer to where the pointer to the memory area containing the 

mapped file information will be stored 7 
int rebuild); 

r indicates we are rebuilding the vg file V 
extern int lvm_gettsinfo( 

int pvfd, I" file descriptor for physical volume */ 
daddrj psnff[LVM_PVMAXVGDAS], 

r array of physical sector numbers for VGDAS */ 
long vgdalen, 

r length of volume group descriptor area 7 
int *copy, I* copy of VGDA with newest timestamp 7 
int rebuild); 

1* indicates we are rebuilding the vg file 7 



r 

* rdex_com.c 

7 

extern int rdex_proc( 

struct Ivjd *lvjd, /'logical volume id 7 

struct extjedlv *extjed, I* maps of pps to be extended or reduced 7 
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char *vgfptr, /* pointer to volume group mapped file */ 

int vgfd, /* volume group file descriptor / 
short minor_num, f minor number of logical volume*/ 

int indicator); f indicator for extend or reduce operation 7 

r 

* revaryonx 
7 

int Ivmjevaryon ( 
struct varyonvg * varyonvg, 

/* pointer to a structure which contains the input information for 

the lvm_varyonvg subroutine V 
intvgmapjd, 

r the file descriptor for the mapped file V 
struct inpvsjnfo * inpvsjnfo, 

r a pointer to the structure which contains information about PVs 

from the input list V 
struct defpvsjnfo * defpvsjnfo); 

r structure which contains information about volume group descriptor 
areas and status areas for the defined PVs in the volume group V 

int lvm_vonmisspv ( 
struct varyonvg * varyonvg, 

I* pointer to a structure which contains the input information for 

the lvm_varyonvg subroutine V 
struct inpvsjnfo * inpvsjnfo, 

r a pointer to the structure which contains information about PVs 

from the input list V 
struct defpvsjnfo * defpvsjnfo, 

r structure which contains information about volume group descriptor 

areas and status areas for the defined PVs in the volume group V 
struct fheader * maphdr_ptr, 

r pointer to the mapped file header V 
caddrj vgda_ptr, 

r pointer to the beginning of the volume group descriptor area */ 
struct pvjieader * pv_ptr, 
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/* a pointer to the header of a physical volume entry in the volume 

group descriptor area V 

intvgjd, ? ... 

r the file descriptor for the volume group reserved area logical 

volume V 
short intinjndex, 

r index into the input list of a physical volume */ 
int'chkwgout); 

I" flag to indicate if the varyonvg output structure should be 
checked */ 



r 

* setupvg.c 
7 

int Ivmjsetupvg ( 

struct varyonvg * varyonvg, 

f pointer to the structure which contains input information for 

varyonvg V 
struct inpvsjnfo* inpvsjnfo, 

/* a pointer to the structure which contains information about PVs 

from the input list*/ 
struct defpvs Jnfo * defpvsjnfo, 

/* pointer to the structure which contains information about the physical 

volumes defined into the kernel V 
struct fheader * maphdr_ptr, 

r a pointer to the file header portion of the mapped file 7 
intvgjd, 

/* file descriptor of the volume group reserved area logical volume / 
caddr_tvgda_ptr, 

r a pointer to the in-memory copy of the volume group descriptor 
area*/ 

struct vgsa_area * vgsajitr, 

r a pointer to the volume group status area */ 
daddrj vgsajsn ff[LVM_MAXPVS] ff[LVM_PVMAXVGDAS], 

P array of logical sector number addresses of all VGSA copies V 
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struct mwc rec*mwcc); 

r buffer which contains latest mirror write consistency cache 7 

int Ivmjbidkivlp ( 
caddr_tvgdaj3tr, 

r a pointer to the volume group descriptor area V 
struct vgsa_area * vgsajrtr, 

r a pointer to the volume group status area */ 
struct Ivol * lvolj>trs ff[LVM_MAXLVS]); 

r array of pointers to the LVDD logical volume structures */ 

intlvm_mwcinfo( 

struct varyonvg * varyonvg, 

I* pointer to the structure which contains input information for 

varyonvg 7 
struct inpvsjnfo * inpvsjnfo, 

T a pointer to the structure which contains information about PVs 

from the input list 7 
struct defpvsjnfo * defpvsjnfo, 

f pointer to structure which contains information about the physical 

volumes defined into the kernel */ 
struct fheader * maphdr_ptr, 

r a pointer to the file header portion of the mapped file V 
intvgjd, 

r file descriptor of the volume group reserved area logical volume 7 
caddrjvgdaj)tr, 

r a pointer to the volume group descriptor area V 
struct vgsa_area * vgsa_ptr, 

I* a pointer to the volume group status area 7 
daddrj vgsajsn ff[LVM_MAXPVS] ff[LVM_PVMAXVGDAS], 

r array of logical sector number addresses of all VGSA copies 7 
struct Ivol * lvol_ptrs ff[LVM_MAXLVS], 

r array of pointers to LVDD logical structures 7 
struct mwcjec * mwcc, 

r buffer which contains latest mirror write consistency cache V 
struct mwcjec * kmwcc, 

r buffer to contain list of logical track groups from the MWCC which 
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need to be resynced in the kernel V 
short int * num_entries); 

r number of logical track group entries in the kernel MWCC buffer / 



r 

* synclp.c 
V 

extern intsynclp( 

int Ivfd, r logical volume file descriptor V 

struct lv_entries *lv, f pointer to logical volume entry 7 

struct uniquejd *vgjd, f volume group id 7 

char *vgptr, t pointer to the volume group mapped file f 

int " " ' J -— 

short 
long 

int force); fresync any non-stale Ip if TRUE 



r 

* utilities.c 
V 

extern int getJvinfo( 

struct Ivjd *lv_id, I* logical volume id V 

struct uniquejd *vgjd, f volume group id 7 

short Tninornum, nogical volume minor number 7 

int *vgfd, t volume group file descriptor 7 

char "vgptr, /* pointer to volume group mapped file 7 

int mode); /* how to open the vg mapped file 7 

extern int get_ptrs( 

char*vgmptr, r pointer to the beginning of the volume 7 

/* group mapped file 7 
struct fheader "header, I* points to the file header 7 
struct vgjieader " vgptr, I* points to the volume group header 7 
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struct gentries **lvptr, f points to the logical volume entries 7 
struct pvjieader "pvptr, /* points to the physical volume header 7 
struct pp_entries k *ppjrtr, f points to the physical partition entries 7 
struct namelist "nameptr); f points to the name descriptor area 7 

extern intlvm_errors( 

char failingjtnff[LVM_NAMESIZ] f /* name of routine with error 7 
char callingLrtnfpM_NAMESIZ], t name of calling routine 7 
int rc); /* error returned from failing rtn 7 

extern int get_pvandpp( 

struct pvjieader "pv, /* pointer to the physical volume header 7 
struct pp_entries "pp, T pointer to the physical partition entry 7 
short ~ *pvnum, fpv number of physical volume id sent in 7 

char *vgfptr, /* pointer to the volume group mapped file 7 

struct uniquejd *id); I* id of pv you need a pointer to 7 

extern int bldlvinfo( 

struct logview "Ip, T pointer to logical view of a logical vol 7 
char " Vgfptr, I* pointer to volume group mapped file 7 
struct lv_entries *lv, T pointer to a logical volume / 
long *cnt, /* number of pps per copy of logical volume 7 

short minorjium, I* minor number of logical volume 7 
int flag); f 6ETSTALE if info on stale pps is desired 7 

rNOSTALEifnot7 

extern int status_chk( 

char *vgptr, I* pointer to volume group mapped file 7 
char *name, r name of device to be checked 7 
int flag, T indicator to check the major number 7 
char *rawname) ; /* pointer to new raw device name 7 

extern int lvm_special_3to2( 
char *pvname, 

/* name of physical volume being removed 7 
struct uniquejd *vgid, 

r pointer to volume group id 7 

int lyfd, 
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F reserved area logical volume file desc 7 
char *vgdaptr 

F pointer to vgda area of the vg file 7 
short pvnumO, F number of pv to delete/remove 7 

short pvnuml, F number of pv to keep one copy*/ 

short pvnum2, F number of pv to keep two copies 7 

char Tnatcn 

F indicates the vgid in the Ivm record 7 

F matches the one passed in */ 
char delete, F indicates we are called by deletepv 7 

structfheader *fhead); F pointer to vg mapped file header*/ 

extern int getstatesf 

struct vgsa_area *vgsa, F pointer to buffer for volume group status 7 

/*area7 

char *vgf ptr); F pointer to volume group mapped file 7 
extern inttimestamp( 

struct vgjieader *vg, F pointer to volume group header 7 

char *vgptr, F pointer to volume group file 7 

struct fheader *fhead); F pointer to vg file header 7 

extern inttimestamp( 

struct vg_header *vg, F pointer to volume group header 7 
char *vgptr, F pointer to volume group file 7 
struct fheader *fhead); F pointer to file header of vg file 7 

extern int buildname) 

devj dev. F device info for physical volume 7 

char nameff[LVM_EXTNAME], F array to store name we create for pv 7 

int mode, F mode to set the device entry to 7 

int type); F type of name to build 7 

extern intrebuildjle( 

struct uniquejd *vgid, F pointer to volume group id 7 
int *vgfd); * F vg file descriptor 7 
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extern void calcjsnf 

struct fheader *fhead, f pointer to volume group file header V 

struct rebuild 'rebuild); f pointer to info from the rebuilding 7 

r of the volume group file*/ 

r 

*varyonvg.c 

V 

intlvm_forceqrm( 
caddrjvgdajrtr, 

r pointer to the volume group descriptor area V 
struct defpvsjnfo * defpvsjnfo); 

1* structure which contains information about volume group descriptor 
areas and status areas for the defined PVs in the volume group V 

void Ivmjnapfile ( 
struct varyonvg * varyonvg, 

r pointer to a structure which contains the input information for 

the Ivrnvaryonvg subroutine */ 
struct inpvsjnfo * inpvsjnfo, 

r a pointer to the structure which contains information about PVs 

from the input list */ 
struct defpvsjnfo * defpvsjnfo, 

r pointer to structure which contains information about PVs which are 

defined into the kernel 7 
struct fheader * mapfilehdr, 

r a pointer to the mapped file header which contains the offsets of 

the different data areas within the mapped file 7 
caddrj vgda_ptr); 

r pointer to the volume group descriptor area 7 

void lvm_pvstatus ( 
struct varyonvg * varyonvg, 

r pointer to the structure which contains input parameter data for 
the Ivrnvaryonvg routine 7 
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struct defpvsjnfo * defpvsjnfo, 

/* pointer to structure which contains information about PVs which are 

defined into the kernel 7 
caddrjvgda_ptr), 

I* pointer to the volume group descriptor area */ 
int*missname, 

r flag to indicate if there are any PV names missing from the input 

list*/ 
int * misspv); 

/* flag to indicate if there are any PVs missing from the varied-on 
volume group (i.e., PVs that could not be defined into the kernel 7 

int Ivmjjpdate ( 

struct varyonvg * varyonvg, 

/* pointer to the structure which contains input information for 

varyonvg */ 
struct inpvsjnfo * inpvsjnfo, 

r a pointer to the structure which contains information about PVs 

from the input list*/ 
struct defpvsjnfo * defpvsjnfo, 

I* structure which contains information about volume group descriptor 

areas and status areas for the defined PVs in the volume group */ 
struct fheader * maphdrjrtr, 

1* a pointer to the file header portion of the mapped file */ 
int vgjd, 

I* the file descriptor for the volume group reserved area logical 

volume*/ 
caddrjvgda_ptr, 

/* pointer to the volume group descriptor area */ 
struct vgsa_area * vgsa_ptr, 

I* pointer to the volume group status area 7 
daddrj vgsajsn ff[LVM_MAXPVS] ff[LVM_PVMAXVGDAS], 

r array which contains the logical sector number addresses of ail 

the VGSAs 7 
char mwcc ff[DBSIZE], 

r buffer containing the latest mirror write consistency cache 7 
intforceqrm, 
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r flag to indicate if the quorum has been forced 7 
int misspv); 

r flag to indicate if there were any missing PVs 7 



r 

* verify.c 
7 

int lvm_verify ( 

struct varyonvg * varyonvg, 

r pointer to the structure which contains input parameter data for 

the lvm_varyonvg routine */ 
int * vgjd, 

/* pointer to the variable to contain the file descriptor for the 

volume group reserved area logical volume 7 
struct inpvsjnfo * inpvsjnfo, 

/* structure which contains information about the input list of PVs 

for the volume group 7 
struct defpvsjnfo * defpvsjnfo, 

/* structure which contains information about volume group descriptor 

areas and status areas for the defined PVs in the volume group 7 
caddrj * vgda_ptr, 

/* pointer to variable where the pointer to the volume group descriptor 

area is to be returned 7 
struct vgsa_area **vgsa_ptr, 

r variable to contain the pointer to the buffer which will contain the 

volume group status area 7 
daddrj vgsajsn ff[LVM_MAXPVS] ff[LVM_PVMAXVGDAS], 
/* array in which to store the logical sector number addresses of the 

VGSAs for each PV 7 
char mwccff[DBSIZE]); 

/* buffer in which the latest mirror write consistency cache will be 
returned 7 
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intlvm_defpvs( 

struct varyonvg * varyonvg, 

T pointer to the structure which contains input parameter data tor 
the Ivmvaryonvg routine V 

' r ^descriptor of the volume group reserved area logical volume V 
struct mpvsjnfo* inpvsjnfo, 

I" structure which contains information about the input list of P Vs for 

the volume group*/ 
struct defpvsjnfo * defpvsjnfo); 
r structure which contains information about the volume group 

descriptor and status areas for PVs defined into the kernel */ 

void lvm_getdainfo ( 
intvgjd, 

r file descriptor for the VG reserved area logical volume which 
contains the volume group descriptor area and status area V 
short intpvjndex, 

r index variable for looping on physical volumes in input list / 
struct inpvsjnfo * inpvsjnfo, 

I* structure which contains information about the input list of PVs for 

the volume group */ 
struct defpvsjnfo * defpvsjnfo); 

I* pointer to structure which contains information about PVs defined 
into the kernel*/ 

int Ivmjeadpvs ( 

struct varyonvg * varyonvg, 

r pointer to the structure which contains input parameter data for 

the lvm_varyonvg routine */ 
struct inpvsjnfo * inpvsjnfo); 

f structure which contains information about the input list of PVs 
for the volume group*/ 

int Ivmjeadvgda ( 
int vgjd, 

I" the file descriptor for the volume group reserved area logical 
volume */ 
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short int override, ... 
r flag which indicates if no quorum error is to be overridden / 
struct inpvsjnfo * inpvs jnfo, , 
r structure which contains information about the input list of PVs 

for the volume group*/ 
struct defpvsjnfo * defpvsjnfo, . 
f structure which contains information about volume group descnptor 
areas and status areas for the defined PVs in the volume group V 
caddrj* vgdajrtr); 

f pointer to buffer in which to read the volume group descriptor 
area*/ 

r 

* vonutLc 
*/ 

void Ivmjclsinpvs ( 
struct varyonvg*varyonvg, 

/* pointer to the structure which contains input parameter data tor 

the lvm_varyonvg routine */ 
struct inpvsjnfo * inpvsjnfo); 

I" structure which contains information about the input list of PVs 
for the volume group*/ 

intlvmjdeladdm( 

struct varyonvg * varyonvg, 

r pointer to the structure which contains input information for 

varyonvg */ 
struct inpvsjnfo * inpvsjnfo, 

r a pointer to the structure which contains information about PVs 

from the input list*/ 
struct defpvsjnfo * defpvsjnfo, 

I* pointer to structure which contains information about the physical 

volumes defined into the kernel */ 
short int pvjium); 
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r the PV number of PV being changed to a missing PV 7 

intlvm_vonresync( 
struct unique Jd*vgjd); 
r pointer to the volume group id 7 



r 

* wrtutlx 
V 

intlvm_diskio( 
caddrj vgjnapptr, 

F a pointer to the mapped file for this volume group 7 
int vgjnapfd); 

r the file descriptor of the mapped file for this volume group */ 

int Ivmjipdvgda ( 
intlvjd, 

I" the file descriptor of the descriptor area logical volume, if it 

is already open 7 
struct fheader * maphdr_ptr, 

r a pointer to the file header portion of the mapped file 7 
caddrj vgda_ptr); 

I* a pointer to the memory location which holds the volume group 
descriptor area 7 

int lvm_wrtdasa ( 
int vgjd, 

I* the file descriptor of the LVM reserved area logical volume 7 
caddrj area_ptr, 

r a pointer to the memory location which holds the volume group 

descriptor or status area 7 
struct timestrucj * ejmestamp, 

r a pointer to the end timestamp for the area 7 
long areajen, 
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r the length in sectors of the area */ 
daddrjlsn, 

r the logical sector number within the LVM reserved area logical 

volume of where to write a copy of the area 7 
short intwrite_order); 

r flag which indicates whether the area is to be written in the order 
of beginning/middle/end or middle/beginning/end 7 

int lvm_wrtmapf ( 
intvgmapjfd, 

r the file descriptor of the mapped file 7 
caddrjvgmapjjtr); 

r the pointer to the beginning of the mapped file 7 

int lvm_wrtnext ( 
int Ivjd, 

r the file descriptor of the LVM reserved area logical volume 7 
caddr__tvgda_ptr, 

I* a pointer to the memory location which holds the volume group 

descriptor area 7 
struct timestrucj *etimestamp, 

r pointer to the ending timestamp in the vg trailer 7 
short int pvnum, 

f the PV number of the PV to which the VGDA is to be written 7 
struct fheader * maphdr_ptr, 

I* pointer to the mapped file header 7 
short int pvnum_vgdas); 

r number of volume group descriptor areas to be written to the PV 7 



#else 

r 

* bbdirutl.c 
7 

intlvm_bbdsane (); 
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intlvmjgetbbdir(); 
intlvmjdbbdir(); 
int lvmjwrbbdir(); 

r 

* bblstutl.c 
V 

voidlvm_addbb(); 
int lvm_bldbblst ( ); 
void Ivmjchgbb ( ); 

r 

* chkquorum.c 

7 

int lvm_chkquorum ( ): 
intlvm_vgsamwcc(): 



r 

* computl.c 

V 

int lvm_chkvaiyon ( ); 
int lvm_mapoff(); 
intlvm_openmap(); 
intlvm_relocmwcc(); 
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intlvmjdiplrec(); 
intlvm_tscomp(); 
intlvm_updtime(); 

r 

* crtinsutLc 
7 

intlvm_initbbdir(); 
void Ivmjnitlvmrec ( ); 
intlvm_instsetup(); 
void lvmj)ventry ( ); 
int lvm_vgdas3to3 ( ); 
int lvm_vgmem ( ); 
int lvm_zeromwc ( ); 
intlvm_zerosa(); 

r 

* configutl.c 
7 

int lvm_addmpv ( ); 
int lvm__addpv ( ); 
intlvm_chgvgsa(); 
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int Ivmjchkvgstat ( ); 
intlvmconfigf); 
int lvm_defvg { ); 
int Ivmjdelpv ( ); 
void Ivmjelvg ( ); 



r 

* Ivmrecutl.c 
V 

void lvm_cmplvmrec ( ); 
intlvm_rdlvmrec(); 
intlvm_wrivmrec(); 
void lvm_zerolvm ( ); 



r 

* queryutl.c 
*/ 

extern int lvm_chklvclos ( ); 
^extern int Ivmjgetpvda ( ); 
extern int Ivmjjettsinfo ( ); 



r 

* rdex_com.c 
V 

FIG. 
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extern int rdex_proc ( ); 

/* struct Ivjd *lv_id logical volume id 
struct extjedlv 'extjed maps of pps to be extended or reduced 
int indicator indicator for extend or reduceperation */ 



r 

* revaryonx 
V 

intlvm_revaryon(); 
void lvm_vonmisspv ( ); 



r 

* setupvg.c 
7 

int Ivmjsetupvg { ); 
int Ivmjbldklvlp ( ); 
intlvm_mwcinfo(); 



r 

* synclp.c 

7 

extern int synclp(); 

I* int Ivfd logical volume file descriptor 

struct lv_entries *lv pointer to logical volume entry 
char *vgptr pointer to the volume group mappedfile 

int vgfd volume group mapped filedescriptor 

short minorjium minor number of the logicalolume 

int Ipnum logical partition number to sync 

FIG. 21-46 

81 



BNSOOCIO <EP <M828S3A2.I.> 



EP0 482 853 A2 



struct uniquejd *vgjd volume group id 

int force resync any non-stale Ip if TRUE*/ 



r 

* utilities.c 
7 

extern int getjvinfo ( ); 

f struct uniquejd *vgjd volume group id 
struct Ivjd *lvjd logical volume id 
int *vgfd volume group file descriptor 

short *minor_num logical volume minor number 
char **vgptr pointer to volume group mapped file 
int mode how to open the vg mapped file */ 

extern intget_ptrs(); 

r struct fheader "header points to the file header 

struct vgheader **vgptr points to the volume group header 

struct lv_entries "Ivptr points to the logical volume entries 

struct pvjieader "pvptr points to the physical volume header 

struct pp_entries **pp_ptr points to the physical partition ents 
struct namelist "nameptr points to the name descriptor area */ 

extern intlvm_errors(); 

r char failing_rtnff[LVM_NAMESIZ] name of routine with error 
char calling_rtnff[LVM_NAMESiq name of calling routine 
int rc " error returned from failing rtnV 

extern int get_pvandpp ( ); 

r struct pv_header **pv pointer to the physical volume header 
struct pp_entries "pp pointer to the physical partition entry 
short *pvnum pv number of physical volume id sent 
char *vgfptr pointer to the volume group mapped file in 
struct uniquejd *id id of pv you need a pointer to 
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extern int bldlvinfo ( ); 

f struct logview "Ip pointer to logical view of a logical vol 
char *vgfptr pointer to volume group mapped file 

struct Iventries *lv pointer to a logical volume 
long *cnt number of pps per copy of logical volume 

short minorjium minor number of logical volume*/ 

extern int status_chk ( ); 

r char *vgptr pointer to volume group mapped file 
char 'name name of device to be checked 
int flag indicator to check the major number 
char *rawname pointer to new raw device name 7 

extern int lvm_special_3to2 ( ); 
r char *pvname, 

name of physical volume being removed 
struct uniquejd *vgid, 

pointer to the volume group id 

int Ivfd, 

reserved area logical volume file desc 
char *vgdaptr, 

pointer to vgda area of the vg file 
short pvnumO, number of pv to delete/remove 

short pvnuml , number of pv to keep one copy 

short pvnum2, number of pv to keep two copies 

char match 

indicates the vgid in the Ivm record matches 

the vgid passed in 
char delete indicates we are called by deletepv 

struct fheader *fhead pointer to vg mapped file header 



extern int getstates ( ); 

I* struct vgsa_area *vgsa pointer to buffer for volume group status 

area 

char *vgfptr pointer to volume group mapped file 
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extern inttimestampO; 
r struct vgjieader H 
char *vgptr, 
struct fheader *fhead); 
•/ 

extern inttimestampO; 

r 

struct vgjieader *vg, 
char *vgptr, 
struct fheader *thead) 
7 



pointer to volume group header 
pointer to volume group file 
pointer to vg file header 



pointer to volume group header 
pointer to volume group file 
pointer to file header of vg file 



extern int buildname ( ); 

fdevj dev, device info for physical volume 

char nameff[LYM_EXrNAME], array to store name we create for pv 

int mode, " ~ mode to set the device entry to 

int type); type of name to build 

7 

extern int rebuildjle ( ); 

r 

struct uniquejd *vgid, pointer to volume group id 

int *vgfd); vg file descriptor 

7 

extern void calc_lsn(); 

r 

struct fheader *fhead, pointer to volume group file header 

struct rebuild Rebuild); pointer to info from the rebuilding 

of the volume group file 

7 



r 

*varyonvg.c 

7 
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int Ivmjdeladdm ( ); 
intlvm_forceqrm(); 
voidlvm_mapfile(); 
intlvmj)vstatus(); 
int lvm_update ( ); 

r 

* verify.c 
7 

intlvm_verify(); 
void Ivm_clsinpvs (); 
int Ivmjdefpvs ( ); 
void Ivmjjetdainfo ( ); 
intlvm_readpvs(); 
intlvmjeadvgda(); 

r 

* wrtutl.c 

V 

int lvm_diskio ( ); 
void lvm_updvgda ( ); 
int lvm_wrtdasa ( ); 
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intlvm_wrtdasa(); 
intlvm_wrtmapf (); 
int lvm_wrtnext ( ); 



#endif r_NO_PROTOV 
#endif f H UBLVM 7 



r DASD.H 7 

ifndef_H_DASD 
#define _H__DASD 

r 

* COMPONENTNAME: (SYSXLVM) Logical Volume Manager - dasd.h 

t 

* © COPYRIGHT International Business Machines Corp. 1988, 1990 

* All Rights Reserved 

7 

r 

* Logical Volume Manager Device Driver data structures. 

7 

include < sys/types.h > 
#include < sys/sleep.h > 
#include < sys/lockl.h > 
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include < sys/sysmacros.h > 
include < sys/buf.h > 
include < sys/lvdd.h > 

r FIFO queue structure for scheduling logical requests. 7 
struct hd_queue{ f queue header structure V 

struct but 'head; /* oldest request in the queue 7 
struct but -tail; F newest request in dequeue */ 

); 

struct hd_capvq{ f queue header structure 7 

struct pv_wait *head; f oldest request in the queue */ 
struct pvjvait *tail; f newest request in the queue 7 



* Structure used by hdjedquiet( ) to mark target PPs for removal. 

* Both are zero relative. 

7 

struct hdjvred { 

long Ip; I* LP the pp belongs to 7 
char mirror; I* mirror number of PP 7 

}; 
r 

* Physical request but structure. 

* A 'pbuf is a 'but* structure with some additional fields used 

* to track the status of the physical requests that correspond to 

* each logical request. A pool of pinned pbuf s is allocated and 

* managed by the device driver. The size of this pool depends on 

* the number of open logical volumes. 
7 

struct pbuf { 

r this must come first, 'buf pointers can be cast to 'pbuf 7 
struct buf pb; /* imbedded buf for physical driver 7 
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f physical but structure appendage: V 

struct buf *pbjbuf; /* corresponding logical but struct V 

/* scheduler I/O done policy function 7 
#ifndef_NO_PROTO 

void (*pb_sched) (struct pbuf # ); 

#else 

void (*pb_sched)(); 

#endif 

struct pvol *pb_pvol; I* physical volume structure 7 
struct bad_blk *pb_bad; /* defects directory entry V 
daddrj pbjtart; /* starting physical address 7 



pbmirror; f current mirror 7 
pbmiravoid; /* mirror avoidance mask 7 
pbmirbad; I* mask of broken mirrors 7 
pbmirdone; I* mask of mirrors done I* 

pbjwretry; I* number of sw relocation retries 7 
pbjype; f Type of pbuf 7 
pbbbop; T BB directory operation 7 
pb_bbstat; /* status of BB directory operation 7 

pb_whl_stop; r wheeljdx value when this pbuf is*/ 
I* to get off of the wheel 7 

pb_hwjeloc; f Debug - it was a HW reloc request7 
pad /* pad to full long word 7 



char 
char 
char 
char 

char 
char 
char 
char 

uchar 

#ifdefDEBUG 
ushort 
char 

#else 

char 

#endif 



padff[3]; 



r pad to full long word 



7 



struct part *pb_part; 7 ptr to part structure. Caremust7 

7 be taken when this is used since 7 
7 the parts structure can be moved 7 
7 by hd_config routines while the 7 
7 request is in flight 7 
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struct uniquejd *pbvgid; 7 volume group ID V 
r used to dump the allocated pbuf at dump time 7 
struct pbuf *pb_forw; V forward pointer 7 
struct pbuf *pb_back; */ backward pointer */ 

}; 

#define pb_addr pb.b_un.b_addr V too ugly in its raw form 7 
r defines for pb_swretry 7 

#deftne MAX.SWRETRY 3 7 maximum retries for relocation 

before declaring disk dead 7 

r values for bjwork in pbuf struct (since real bwork value only used 

* in Ibuf) 
V 

#define F1X_READ_ERR0R 1 7 fix a previous EMEDIA read error 7 

#define FIX JSOFT 2 r fix a read or write ESOFT error 7 

#define FIXJMEDIA 3 f fix a write EMEDIA error 7 

/* defines for pbjtype */ 

#define SA_PVMISSING 1 /* PV missing type request 7 

#define SA_STALEPP 2 /* stale PP type request 7 

#define SAJRESHPP 3 /* fresh PP type request 7 

#define SA_CONFIGOP 4 I* hdjconfig operation type request 7 

r 

* defines to tell hd_bldpbuf what kind of pbuf to build 

* These defines are not the only ones that tell hd_bldpbuf what to 

* build. Check the routine before changing/adding new defines here 
7 

#define CATYPEJVRT 1 f pbuf struct is a cache write type 7 

r 

* defines for pbjbbop 
* 

* First set is used by the requests pbuf that is requesting the BB operation. 
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* The second set is used in the bb_pbuf to control the action of the 

* actual reading and writing of the BB directory of the PV. 
V 

#define BB ADD 41 P Add a new bad block entry to BB directory V 

#define BB JJPDATE 42 P Update a bad block entry to BB directory 7 
#define BB_DELETE 43 P Delete a bad block entry to BB directory V 
#define BB_RDDFCT 44 P Reading a defective block V 
^define BBJWTDFCT 45 /'Writing a defective block 7 
#define BB_SWRELO 46 T Software relocation in progress 7 

#define RD BBPRIM 70 P Read the BB primary directory V 

#define WT UBBPRIM 71 P Write BB prim dir with UPDATE 7 

#define WTJDBBPRIM 72 P Rewrite BB prim dir 1 st blk with UPDATE 7 

#defineWT_UBBBACK 73 P Write BB backup dir with UPDATE 7 

#define WTJDBBBACK 74 P Rewrite BB back dir 1 st blk with UPDATE 7 

P defines for pbberror; 0-63 (good) 64-127 (bad) 7 
#define BB__SUCCESS 0 P BBdir updating worked 7 
#define BB_CRB 1 P Reloc blkno was changed in this BB entry 7 

#define BB_ERROR 64 P Bad Block directories were not updated 7 
#define BB FULL 65 P BBdir is full -no free bad blk entries 7 



P 

* Volume group structure. 

* Volume groups are implicitly open when any of their logical volumes are. 
7 

#define MAXVGS 255 P implementation limit on # VGs 7 

#define MAXLVS 256 P implementation limit on # LVs 7 

#define MAXPVS 32 P implementation limit on number7 

P physical volumes per vg 7 
#define CAHHSIZE 8 P Number of mwc cache queues 7 
#define NBPI (NBPB * sizeof (int)) P Number of bits per int 7 
#define NBPL (NBPB * sizeof(long)) P Number of bits per long 7 
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F macros to set and clear the bits in the opn_pin array 7 
#define SETLVOPN(Vg,N) ((Vg)-> opnj)inff[(N)/NBPIl |=1 < < ((N)%NBPl)) 
#defineCLRLVOPNVg,N) ((Vg)-> opn Jnff[(N)/NBPI] & = "(1 « ((N)%NBPI))) 
#define TSTLVOPN(Vg,N) ((Vg)-> opn_pinff[(N)/NBPIl & 1 « ((N)%NBPI)) 

r 

* macros to set and clear the bits in the ca_pv_wrtfield 

* NOTE TSTALLPVWRT will not work if max PVs per VG is greater than 32 
7 

#define SETPVWRT(Vg,N) ((Vg).> caj)vjwtff[(N) / NBPL] |= 1 « ((N) % MAXPVS)) 
#defme CLRPVWRT(Vg,N) ((Vg)-> ca_pv wrtfr[(N) / NBPL] &= ((N) % MAXPVS))) 
#defineTSTPVWRT(Vg,N) ((Vg)->caj)v:wrtfl(N)/r€PL]&(1<<«N)%MAXPVS))) 
#define TSTALLPVWRT(Vg) ((Vg)-> ca_pv_wrtff[(MAXPVS - 1 ) / NBPL]) 

r 

* head of list of varied on volgrp structs in the system 
7 

struct 

' lockj lock; r lock while manipulating list of VG structs 7 

struct volgrp * ptr; /* ptr to list of varied on VG structs 7 
} hd.vghead = {EVENT.NULL, NULL}; 



struct volgrp { 

lockj vgjock; r lock for all vg structures 7 

short padl; t pad to long word boundary 7 

short partshift; f log base 2 of part size in blks 7 

short open_count; F count of open logical volumes 7 

ushort flags; F VG flags field 7 

ulong tot io_cnt; F number of logical request to VG 7 

struct Ivol ~ *lvolsff[MAXLVS]; F logical volume struct array 7 

struct pvol *pvolsff[MAXPVSJ; F physical volume struct array 7 

long majorjium; F major number of volume group 7 

struct uniquejd vgjd; F volume group id 7 
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struct volgrp *nextvg; /* pointer to next volgrp structure V 

r Array of bits indicating open LVsV 
TAbitperLV " ' 7 

int opn _pinff[(MAXLVS + (NBPI - 1 ))/NBPQ; 

pidj von_pid; I* process ID of the varyon process*/ 

r Following used in write consistency cache management V 
struct volgrp Twtactvg; f pointer to next volgrp with 7 

/* write consistency activity V 
struct pv_wait 'cajreepvw; /* head of pvjwait free list 7 
struct pv_wait *ca_pvwmem; fptr to memory malloced for pvw 7 

r free list 7 
struct hd_queue ca_hld; f head/tail of cache hold queue 7 
ulong ca_pv_wrtff[(MAXPVS + (NBPL - 1 )) / NBPL]; 

r when bit set write cache to PV 7 
char ca_inflt_cnt; f number of PV active writing cache7 

char ca_size; I* number of entries in cache 7 

ushort ca_pvwblked; f number of times the pv_wait free 7 

r list has been empty 7 
struct mwcjec *mwc_rec; I* ptr to part 1 of cache - disk rec7 
struct ca_mwc_mp *ca_part2; r ptr to part 2 of cache -memory 7 
struct cajnwcmp *ca_lst; I* mru/lru cache list anchor 7 
struct ca_mwc_mp *ca_hashff[CAHHSIZE]; /* write consistency hash anchors7 

r the following 2 variables are used to control a cache clean up opera-7 
/*tion. 

pidj bcachwait; T list waiting at the beginning 7 

pidj ecachwait; T list waiting at the 7 

volatile int wait_cnt; f count of cleanup waiters 7 

I* the following are used to control the VGSAs and the wheel 7 

uchar quorum_cnt; r Number indicating quorum of SAs 7 

uchar wheeljdx; /* VGSA wheel index into pvols 7 

ushort whl_seq_num; r VGSA memory image sequence number7 

struct pbuf *sa_actjst; r head of list of pbufs that are 7 

r actively on the VGSA wheel 7 
struct pbuf *sa_hld_lst; r head of list of pbufs that are 7 
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f waiting to get on the VGSA wheel 7 
struct vgsa_area *vgsa_ptr; /* ptr to in memory copy of VGSA 7 
pid t configjwait; f PID of process waiting in the 7 

/* hdjconfig routines to modify the V 

f memory version of the VGSA 7 
struct buf sajbuf ; f logical buf struct to use to wit 7 

rtheVGSAs 7 
struct pbuf sajbuf; f physical buf struct to use to wrt7 

rtheVGSAs 7 

}; 
r 

* Defines for flags field in volgrp structure 

#define VG_SYSMGMT 0x0002 / *VG is on for system management 7 

f only commands 7 
#define VG_FORCEDOFF 0x0004 /* Should only be on when the VG was7 
#define VG_OPENING 0x0008 T VG is being varied on 7 
r forced varied off and there were LVs still open. Under this con-7 
r dition the driver entry points can not be deleted from the device7 
r switch table. Therefore the volgrp structure must be kept 7 
I* around to handle any rogue operations on this VG. 7 
#define CAJNFLT 0x0010 f The cache is being written or 7 

Tlocked 7 
#defineCA_VGACT 0x0020 f This volgrp on mwc active list 7 
#define CA.HOLD 0x0040 t Hold the cache in flight 7 

#define CAJULL 0x0080 f Cache is full - no free entries 7 

#define SA_WHL_ACT 0x0100 /* VGSA wheel is active 7 
#define SA_WHL_HLD 0x0200 /* VGSA wheel is on hold 7 
#define SA_WHL_WAIT 0x0400 /* config function is waiting for 7 

r the wheel to stop 7 



r 

* Logical volume structure. 
7 

struct Ivol ( 
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struct but "workQ; /* work in progress hash table V 
short Ivjstatus flv status: closed, closing, open 7 
short rvoptions; f logical dev options (see below) 7 
short nparts; fnum of part structures for this 7 

riv-base1 

char i_sched; f initial scheduler policy state 7 

char pad; I* padding so data word aligned 7 

ulong nWocks; /* LV length in blocks 7 

struct part *partsff[3]; f partition arrays for each mirror 7 

ulong tot_wrts; I* total number of writes to LV 7 

ulong totjds; f total number of reads to LV 7 

f These fields of the Ivol structure are read and/or written by 

* the bottom half of the LVDD; and therefore must be carefully 

* modified. 
7 

int complcnt; f completion count_used to quiesce 7 
int waitlist; /* event list for quiesce of LV 7 

}; 



riv status: 7 

#define LV_CLOSED 0 
#define LVCLOSING 
#define LVOPEN 2 

r scheduling policies: 7 
#define SCH.REGULAR 0 
#define SCH.SEQUENTIAL 1 
#define SCH_PARALLEL 2 
#define SCH.SEQWRTPARRD 
#define SCH PARWRTSEQRD 



f logical volumes is closed 7 
1 r trying to dose the LV 7 
r logical volume is open 7 



r regular, non-mirrored LV 7 
r sequential write, seq read 7 
r paralled write, read closest 7 
r sequential write, read closest*/ 
T parallel write, seq read 7 



3 
4 



r logical device options: 7 

#define LV_NOBBREL 0x0010 /* no bad block relocation 7 
#define LV_RDONLY 0x0020 /* read-only logical volume 7 
#define LV_DMPINPRG 0x0040 I* Dump in progress to this LV 7 
#define LV_DMPDEV 0x0080 f* This LV is a DUMP device 7 

r i.e. DUMPINIT has been done 7 



FIG. 21-59 



94 



BNSOOCID <£P 04826S3A2 I > 

t 



EP 0 482 853 A2 



#defineLV_NOMWC 0x0100 C no mirror write consistency V 

I* checking */ 
#define LV_WRITEV WRITEV /* Write verify writes in LV 7 

rworkjQ hash algorithm -just a stub now 7 
#defineHDHASH(Lb) \ 

(BLK2TRK((Ub)-> b_blkno) & (WORKQ_SIZE-1 )) 

r 

* Partition structure. 
7 

struct part { 

struct pvol *pvol; 7 containing physical volume */ 

daddrj start V starting physical disk address */ 
short syncjrk; I* current LTG being resynced V 
char ppstate; f physical partition state */ 
char sync_msk; I* current LTG sync mask */ 

); 
r 

* Physical partition state defines PP_ and structure defines. 

* The PP_STALE and PP_REDUCING bits could be combined into one but it 

* is easier to understand if they are not and a problem arises later. 

* The PP_RIP bit is only valid in the primary part structure. 

7 

#define PP_STALE 0x01 /* Set when PP is stale 7 
#define PP_CHGING 0x02 /* Set when PP is stale but the V 

I* VGSAs have not been completely */ 

r updated yet */ 
#define PP_REDUCING 0x04 (* Set when PP is in the process V 

I* of being removed (reduced out */ 
#define PP_RIP 0x08 I* Set when a Resync is in progress 7 

/* When set "syncjrk" indicates 7 

r the track being synced. If 7 
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/* syncjrk not = = -1 and PP_RIP 7 
r not set syncjrk is next trk 7 
I* to be synced */ 
#define PP_SYNCERR 0x1 0 f Set when error in a partition 7 

r being resynced. Causes the 7 
r partition to remain stale. 7 

#defineNO_SYNCTRK -1 /•TheLPdoesnothavearesync 7 

P in progress 7 

r 

* Physical volume structure. 

Contains defects directory hash anchor table. The defects , 
directory is hashed by track group within partition. Entries within 

* each congruence class are sorted in ascending block addresses. 

This scheme doesn't quite work, yet. The congruence classes need 
to be aligned with logical track groups or partitions to guarantee 
that all blocks of this request are checked. But physical addresses 
heed not be aligned on track group boundaries. 



#define HASHSIZE 64 /* number of defect hash classes 7 

struct dsf set tbl { 

1 strucFbad_blk*defectsff[HASHSIZE]; I* defect directory anchor 7 

}; 

struct pvol { 
devj 
daddrj 
short 
short 
short 
short 

struct file* 
char 
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dev; PdevJ of physical device 7 
armpos; /* last requested arm position 7 
xfent; r transfer count for this pv 7 

pvstate; /*PV state 7 
pvnum; I* LVM PV number 0-31 7 

vgjium; f VG major number 7 

fp; r file pointer from open of PV 7 

flags; /* place to hold flags 7 
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char pad; f unused V 

short num_bbdir_ent; /* current number of BB Dir entries V 

daddrj fst_usr_blk; I* first available block on the PV V 

F for user data V 
daddrj begjelblk; f first blkno in reloc pool 7 
daddrj nextjelblk; /* blkno of next unused relocation */ 

r block in reloc blk pool at end 7 
TofPV 7 
daddrj maxjelblk; r largest blkno avail for reloc 7 
struct defect Jbl 'defect Jbl; f pointer to defect table 7 
struct hd_capvq caj)v; /* head/tail of queue of request V 

r waiting for cache write to 7 
/•complete V 
struct saj)v_whl ( /* VGSA information for this PV 7 

daddrj Isn; /* SA logical sector number - LV 0 7 
ushort sajeqjium; ?SA wheel sequence number 7 
char nukesa; /* flag set if SA to be deleted 7 

char pad; /* pad to full long word 7 
) sa_areaff[2]; I* one for each possible SA on PV 7 

struct pbuf pv_pbuf ; /* pbuf struct for writing cache 7 

); 

r defines for pvstate field 7 

#define PV-MISSING 1 /* PV cannot be accessed 7 
#define PV-RORELOC 2 /* No HW or SW relocation allowed 7 

r only known bad blocks relocated 7 

r 

* returns index into the bad block hash table for this block number 
7 

#define BBHASH JND(blkno) (BLK2TRK(blkno) & (HASHSIZE - J )) 



* Macro to return defect directory congruence class pointer 
7 

#define HASH_BAD(Pb,Bad_blkno) \ 

((Pb)-> pb_pvol->defect_tbl-> defectsff[BLK2TRK(Bad_blkno)&(HASHSIZE-1)]) 
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r 

' Used by the LVM dump device routines same as HASH_BAD but the first 
* argument is a pvol struct pointer 

V 

#define HASH BAD DMP(Pvol.Blkno) \ 

((Pvol)-> defect tbl-> defectsff[BLK2TRK(BIkno)&(HASHSlZE-1 )]) 



r 

* Bad block directory entry. 

7 

struct bad_blk I f bad block directory entry / 

struct baa_blk # next; /* next entry in congruence class I ^ 

devj ~ dev; /* containing physical device */ 
daddrj blkno; /* bad physical disk address */ 

unsigned status: 4; f relocation status (see below) 7 
unsigned reblk: 28; f relocated physical disk address */ 

); 

fbad block relocation status values: V 

#defineREL_DONE 0 f software relocation completed / 

#define REL_PENDING 1 /* software relocation in progress V 
#define REL_DEVICE 2 I* device (HW) relocation requested */ 
#define REL_CHAINED 3 /* relocation blk structure exists V 
#define REL_DESIRED 8 P relocation desired-hi order bit on*/ 

r 

* Macros for getting and releasing bad block structures from the 

* pool of bad_blk structures. They are linked together by their next pointers. 

* "hdj reebad" points to the head of bad-blk free list 

* NOTE: Code must check if hd_freebad != null before calling 

the GET_BBLK macro. 

#defineGET_BBLK(Bad) {\ 

(Bad) = hdjreebad;\ 
hdjreebad = hdj reebad-> next; \ 
hd_freebad_cnt-;\ 

1" 
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#define REL_BBLK(Bad) {\ 

(Bad)-> next = hdjreebad; \ 
hdjreebad = (Bad); \ 
hdJreebad_cnt++;\ 

} " 

r 

* Macros for accessing these data structures. 
7 

define VG_DEV2LV(Vg, Dev) ((VG)-> lvolsff[minor(Dev)]) 
#define VG_DEV2PV(Vg, Pnum) ((Vg)-> pvolsff[(Pnum)]) 

#define BLK2PART(Pshift,Lbn) ((u!ong)(Lbn) > > (Pshift)) . 
#define PARF2BLK(Pshift,P_no) ((Pro) < < (Pshift)) 
define PARTITION(LV,P_no f Mir) ((Lv)-> partsff[(Mir)] + (P_no)) 

r 

* Mirror bit definitions 
V 

#define PRIMARY_MIRROR 001 f primary mirror mask */ 

define SECONDARY_MIRROR 002 T secondary mirror mask 7 
#define TERTIARY_MIRROR 004 f tertiary mirror mask V 

#define AU.MIRRORS 007 f mask of all mirror bits 7 

I* macro to extract mirror avoidance mask from ext parameter */ 

#define X_AVOID(Ext) ( ((Ext) > > AVOID_SHFT) & ALLJIRRORS ) 

r 

Macros to select mirrors using avoidance masks: 

FIRST_MIRROR returns first unmasked mirror (0 to 2); 3 if all masked 
FIRST_MASK returns first masked mirror (0 to 2); 3 if none masked 
MIRROR_COUNT returns number of unmasked mirrors (0 to 3) 
M IRROR_M ASK returns a mask to avoid a specific mirror (1 , 2, 4) 
MIRROR_EXIST returns a mask for non-existent mirrors (0, 4, 6, or 7) 

f 
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#define FIRST_MIRROR(Mask) ((0x30102010 > > ((Mask) < < 2))&0x0f) 

#define FIRST MASK(Mask) ((0x01020103 > > ((Mask) < < 2))&0x0f) 

#defin6 M IRROR_COUNT(Mask) ((0x01121223 >> ((Mask) <<2))&0x0f) 

#define M IRROR_EXIST(Nmirrors) ((0x00000467 > > ((Nmirrors) < < 2))&0x0f) 

#define MIRROR JIASK(Mirror) (1 < < (Mirror)) 

r 

* DBSIZE and DBSHIFT were originally UBSIZE and UBSHIFT from pararah. 

* There were renamed and moved to here to more closely resemble a disk 

* block and not a user block size. 
7 

#define DBSIZE 512 /* Disk block size in bytes 7 

#defineDBSHIF 9 Hog 2 of DBSIZE V 

r 

* LVPAGESIZE and LVPGSHIFT were originally PAGESIZE and PGSHIFT from paramJi 

* There were renamed and moved to here to isolate LVM from the changable 

* system parameters that would have undesirable effects on LVM functionality. 
7 

#define LVPAGESIZE 4096 f Page size in bytes 7 
#define LVPAGESHIFT 12 f log 2 of LVPAGESIZE 7 

#define BPPG (LVPAGESIZE/DBSIZE) f blocks per page 7 
#define BPPGSHIFT (LVPGSHIFT-DBSHIFT) r log 2 of BPPG 7 
#define PGPTRK 32 /* pages per logical track group 7 

#define TRKSHIFT 5 f log base 2 of PGPTRK 7 

#define LTGSHIFT (TRKSHIFT + BBGSHIFT)/* logical track group log base 27 
#define BYTEPTRK PGPTRKWAGESIZE t bytes per logical track group 7 
#define BLKPTRK PGPTRK*BPPG Tblocks per logical track group*/ 

#define SIGNED_SHIFTMSK 0x80000000 /* signed mask for shifting to 7 

7 get page affected mask 7 

#define BLK2BYTE(Nblocks) ((unsigned)(Nblocks) < < (DBSHIFT)) 
#define BYTE2BLK(Nbytes) ((unsigned)(Nbytes) > > (DBSHIFT)) 
#define BLK2PG(Blk) ((unsigned)(Blk) > > BPPGSHIFT) 

#define PG2BLK(Pageno) ((Pageno) < < (LVPGSHIFT-DBSHIFT)) 

#define BLK2TRK(Blk) ((unsigned)(Blk) > > (TRKSHIFT + BPPGSHIFT)) 

#define TRK2BLK(T_no) . ((unsigned)(T_no) < < (TRKSHIFT + BPPGSHIFT)) 
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#define PG2TRK(Pageno) ((unsigned)(Pageno)) > > (TRKSHIFT)) 
r LTG per partition 7 

#define TRKPPART(Pshift) ((unsigned)(1 < < (Pshift -LTGSHIFT))) 
r LTG in the partition 7 

#defineTRKJN_PART(Pshift, Blk) ( BLK2TRK(Blk) & (TRKPPART(Pshift) -1) ) 



7 defines for top half of LVDD */ 
#define LVDD_HFREE_BB 
#define LVDD_LFREE_BB 
#defineWORKQ_SIZE 
#definePBSUBPOOLSIZE 
#define HD_ALIGN 
#define FULL.WORDMASK 
Idefine BUFCNT . 3 

Tstructs 



64 
16 



30 r high water mark for kernel bad_blk struct*/ 
15 Tlow water mark for kernel bad _blk struct V 
r size of LVs work in progress queue V 
r size of pbuf subpool alloc'd by PVs 7 
r align characteristics for alloc'd memory 7 
3 r mask for full word (log base 2) 7 
} parameter sent to uphysio for # buf 7 
to allocate 7 



#defineNOMIRROR0 
#define PRIMMIRROR 
#define SINGMIRROR 
#defineDOUBMIRROR 

#define MAXNUMPARTS 
#define PVNUMVGDAS 



f no mirrors 



7 



0 r primary mirror */ 

1 Tone mirror 7 

2 r two mirrors 7 

3 /* maximum number of parts in a logical part 7 
2 r max number of VGDA/VGSAs on a PV 7 



I* return codes for LVDD top 1/2 7 
#define LVDD.SUCCESS 0 
#define LVDD_ERROR -1 
#define LVDD_NOALLOC 

#endif I* _H_DASD 7 
r HD.H 7 

#ifndef_H_HD 
#define H HD 



r general success code 7 

I* general error code 7 

-200 r hdjnit: not able to allocate pool of bufs7 
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CX)MPONENT_NAME: (SYSXLVM) Logical Volume Manager Device Driver - hd.h 

© COPYRIGHT International Business Machines Corp. 1988, 1990 
All Rights Reserved 



7 



include < sys/errids.h > 



* LVDD internal macros and extern statidy declared variables. 
V 



TLVM internal defines;*/ 
#define FAILURE 
#define SUCCESS 
#define MAXGRABLV 
#defineMAXSYSVG3 
#define CAHEAD 
#define CATAIL 
#defineCA_MISS 
#defineCA_HIT 
#defineCA LBHOLD 



0 r must be logic FALSE for 'if tests 7 
1 r must be logic TRUE for ? tests 7 
16 I" Max number of LVs to grab pbuf structs V 
I* Max number of VGs to grab pbuf structs */ 



1 r move cache entry to head of use list 

2 I* move cache entry to tail of use list 

0 /*MWC cache miss V 

1 r MWC cache hit V 

2 I* The logical request should hold 



V 
7 



V 



r 

* Following defines are used to communicate with the kernel process 
7 

#define LVDD_KP_TERM 0x80000000 P Terminate the kernel process 7 
#define LVDD_KP_BADBLK 0x40000000 r Need more bad_blk structs 7 
#define LVDD_KP_ACTMSK OxCOOOOOOO f Mask of all events 7 

r 

* Following defines are used in the boptions of the logical but struct. 

* They should be reserved in Ivdd.h in relationship to the ext parameters 

7 

#define REQ_IN_CACH 0x40000000 r When set in the Ibuf boptions 7 

I* the request is in the mirror 7 
I* write consistency cache 7 
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#define REQJ/GSA 0x20000000 f When set in the Ibuf boptions 7 

1* it means this is a VGSA write 7 
/* and to use the special sa_pbuf V 
fin the volgrp structure 7 

I**** * ********************** ************************ 

* The following variables are only used in the kernel and therefore are 

* only included if the _KERNEL variable is defined. 

#ifdef .KERNEL 
include < sys/syspesth > 

r 

* Set up a debug level if debug turned on 

#ifdef DEBUG 
#ifdefLVDD_PHYS 
BUGVDEF(debuglvl, 0) 

$6lS6 

BUGXDEF(debuglvl) 

#endif 

#endif 

r 

* pending queue 

This is the primary data structure for passing work from 

* the strategy routines (see hd_stratc) to the scheduler 
(see hd_sched.c) via the mirror write consistency logic. 
From this queue the request will go to one of three other 
queues. 

* 1 . cache hold queue - If the request involves mirrors 

and the write consistency cache is in flight. 
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i.e. being written to PVs. 

* 2. cache PV queue - If the request must wait for the 

* write consistency cache to be written to the PV. 

* 3. schedule queue - Requests are scheduled from this 

* queue. 

This queue is only changed within a device driver critical section. 

ifdefLVDDPHYS 

struct hdjqueue pending_Q 

#else 

extern struct hd_queue pendingjQ; 
#endif 

r 

* ready queue - physical requests that are ready to start 

This queue is only valid within a single critical section. 

* It really contains a list of pbuf s, but only the imbedded 

* buf struct is of interest at this point Since the pointers 
are of type (struct buf *) it is convenient that the queue be 

* declared similarly. 

#ifdefLVDD_PHYS 

struct buf *ready_Q = NULL; 

#else 

extern struct buf *ready_Q; 
#endif 

r 

* Chain of free and available pbuf structs. 
7 

#ifdef LVDDPHYS 

struct pbuf *hd_freebuf = NULL; 
#else 
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extern struct pbuf *hd Jreebuf ; 
#endif 

r 

* Chain of pbuf structs currently allocated and pinned for LVDD use. 

* Only used at dump time and by crash to find them. 
V 

fffltef LVDDJWS 

struct pbuf *hd_dmpbuf=NULL 
#else 

extern struct pbuf *hd_dmpbuf; 
#endif 

r 

Chain and count of free and available bad J)lk structs. 
The first open of a VG, really the first open of an LV, will cause 
LVDD_HFREE_BB( currently 30 ) bad_blk structs to be allocated and 
chained here. After that when the count gets to LVDDJJREE_BB(low 
water mark, currently 1 5) the kernel process will be kicked to go 
get more up to LVDDHFREE_BB( high water mark ) more. 

*NOTE* hdjreebadjk is a lock mechanism to keep the top half of the 
driver and the kernel process from colliding. This would only 
happen if the last request before the last LV closed received 
an ESOFT or EMEDIA( and request was a write ) and the getting of 
a bad_blk struct caused the count to go below the low water 
mark. This would result in the kproc frying to put more 
structures on the list while hd_close via hdjfrefrebb would 
be removing them. 

I 

#ifdef LVDDJWS 

int hdjreebadjk = LOCK.AVAIL; 

struct bad_blk *hdjreebad = NULL; 
int hdjreebad_cnt = 0; 

#else 

int hdjreebadjk; 
extern struct bad _blk *hdjreebad; 
extern int hdjreebad_cnt; 
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#endff 

r 

* Chain of volgrp structs that have write consistency caches that need 

* to be written to PVs. This chain is used so all incoming requests 

* can be scanned before putting the write consistency cache in flight. 

* Once in flight the cache is locked out and any new requests will have 

* to wait for all cache writes to finish. 
V 

#ifdefLVDD_PHYS 

struct volgrp *hd_vg_mwc = NULL; 

#else 

extern struct volgrp *hd_vg__mwc; 
#endif 

r 

* The following arrays are used to allocate mirror write consistency 

* caches in a group of 8 per page. This is due to the way the hide 

* mechanism works only on page quantities. These two arrays should be 

* treated as being in lock step. The lock, hd_ca_lock, is used to 

* ensure only one process is playing with the arrays at any one time. 

#define VGS_CA ((MAXVGS + (NBPB - 1 )) / NBPB) 
#ifdefLVDD_PHYS 

lockj hd_ca_lock = LOCK.AVAIL; f lock for cache arrays 7 

char ca_allocedff [VGS_CA] ; f bit per VG with cache allocated 7 
struct mwcjec ^ajrpj)trff[VGS_CA]; f 1 for each 8 VGs 7 
#else 

extern lockj hd_ca_lock; 
extern char ca_a!locedff[]; 
extern struct mwcjec *ca_grp_ptrff[ ]; 
#endif 

r 

* The following variables are used to control the number of pbuf 

* structures allocated for LVM use. It is based on the number of 
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* PVs in varied on VGs. The first PV gets 64 structures and each 

* PV thereafter gets 1 6 more. The number is reduced only when a 

* VG goes inactive, i.e. all ifs LVs are closed. 
V 

#ifdef LVDD_PHYS 

int hd_pbuf_cnt = 0; f Total Number of pbufs allocated */ 

int hd_pbufjjrab = PBSUBPOOLSIZE; /* Number of pbuf structs to allocate*/ 

r for each active PV on the system */ 

int hdjbufjnin = PBSUBPOOLSIZE * 4; 

r Number of pbuf to allocate for the */ 
F first PV on the system 7 1 

int hd_vgs_opn = 0; t Number of VGs opened */ 

inthd_lvsj>pn = 0; f Number of LVs opened */ 

int hd_pvs_opn = 0; r Number of PVs in varied on VGs */ 

int hd__pbuf Jnuse = 0; /* Number of pbufs cunently in use 7 

int hd_pbuf_maxuse = 0; /* Maximum number of pbufs in use during*/ 

/* this boot */ 

#else 



extern 
extern 
extern 
extern 
extern 
extern 
extern 
extern 
#endif 



int hd_pbuf_cnt; 
int hd_pbuf_grab; 
int hd_pbuf_min; 
nt hd_vgs_opn; 
int hd_lvs_opn; 
int hd_pvs_opn; 
int hd_pbuf Jnuse; 
nt hd_pbuf_maxuse; 



r 

* The following are used to update the bad block directory on a disk 
*/ 

#ifdef LVDD PHYS 
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I 



struct pbuf *bb_pbuf ; /* ptr to pbuf reserved for BB dir updating V 
struct hd_queue bb_hld; f holding Q used when there is a BB 7 

I* directory update in progress V 

#else 

extern struct pbuf *bb_pbuf; 
extern struct hd_queue bbhld; 
#endif 

r 

* The following variables are used to communicate between the LVDD 

* and the kernel process. 
7 

ffifdef LVDD.PHYS 

pidj hd_kpid =0; /* PID of the kernel process 7 
#else 

extern pidj hd kpid; 
#endif 



r 

* The following variables are used in an attempt to keep some information 

* around about the performance and potential bottle necks in the driver. 

* Currently these must be looked at with crash or the kernel debugger. 

#ifdef LVDDJWS 

ulong hd_pendqblked =0;T How many times the scheduling queue 7 

r (pending_Q) has been block due to no 7 
r pbufs being available. 7 

#else 

extern ulong hdpendqblked; 
#endif * 

r 

* The following are used to log error messages by LVDD. The de_data 

* is defined as a general 16 byte array, BUT, ifs actual use is 

* totally dependent on the error type. 
7 

#define RESRCJIAME "LVDD" f Resource name for error logging 7 
struct hd__errlog_.ent { r Error log entry structure 7 
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struct irrjecO id; 
char de_dataff[16]; 

}; 

r macros to allocate and free pbuf structures 7 
#defineGET_PBUF(PB) {\ 

(PB) = hd_freebuf; \ 

hdjreebuf = (struct pbuf *) hd_freebuf-> pb.avJorw; \ 
hdj)buf_inuse + + ;\ 
if( hd_pbuf_inuse > hdj)buf_maxuse ) \ 
hd_pbuf_maxuse = hd_pbuf jnuse; \ 

#define REL_PBUF(PB) {\ 

(PB)-> pb.avJorw = (struct buf *)hd_freebuf; \ 
hdfreebuf = (PB);\ 
hd_pbuf _inuse~; \ 

r macros to allocate and free pvwait structures 7 

#define GET_PVWAIT(Pvw, Vg) {\ 

(Pvw) = (Vg)-> ca_freepvw; \ 

(Vg)-> ca_freepvw = (Pvw)-> nxt_pv_wait; \ 

#define REL_PVWAIT(Pvw, Vg) { \ 

(Pvw)->nxt_pv_wait=(Vg)->ca_freepvw; \ 
(Vg)-> ca_freepvw = (Pvw); \ 

#define TST_PVWAIT(Vg) ((Vg)-> ca_freepvw = = NULL 

r 

* Macro to put volgrp ptr at head of the list of VGs waiting to start 

*MWC cache writes 

7 

#define CA_VG_WRT( VG ) { \ 

if( !((Vg)-> flags &CA_VGACT))\ 
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(Vg)-> nxtactvg = hd_vgL_mwc; \ 
hd_vg_mwc = (Vg); \ 
(Vg)-> flags | = CA_VGACT;\ 



r 

* Macro to determine if a physical request should be returned to 

* the scheduling layer or continue(resume). 

7 

#define PB_CONT( PB ) {\ 
if(((Pb)-> pbjaddr = = ((Pb)-> pb_lbuf-> b_baddr + (Pb)-> pb_lbuf-> bbcourt)) || \ 
((Pb)->pb.bJags&B_ERROR))\ 
HD_SCHED((Pb));\ 

else \ 

hd_resume( (Pb) ): \ 

r 

* HD_SCHED - invoke scheduler policy routine for this request. 

* For physical requests it invokes the physical operation end policy. 
#define HD_SCHED(Pb) (*(Pb)-> pb_sched)(Pb) 



1* define for b_error value (only used by LVDD) 7 
#define ELBBLOCKED 255 * f this logical request is blocked by 7 

r another on in progress 7 

#endif /'.KERNEL 7 

r 

* Write consistency cache structures and macros 
7 

r cache hash algorithms - returns index into cache hash table 7 

#define CA_HASH(Lb) (BLK2TRK((LB)->b_blkno) & (CAHHSIZE-1 )) 

#define CA_THASH(Trk) ((TRK & (CAHHSIZE-1 )) 
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r 

* This structure will generally be referred to as part 2 of the cache 
V 

struct ca_mwc_mp { f cache mirror write consistency memory only part 7 

struct cajriwcmp *hq_next; Tptr to next hash queue entry 7 
char state; r State of entry 7 

char pad1; f Pad to word 7 

ushort locnt; f Non-zero -io active to LTG 7 

struct ca_mwcjdp *part1 ; /* Ptr to part 1 entry - ca_mwc_dp 7 
struct ca_mwc_mp Yiext; T Next memory part struct 7 
struct ca_mwc_mp *prev; r Previous memory part struct 7 

}; 

r ca_mwc_mp state defines 7 

#define CANOCHG 0x00 f Cache entry has NOT changed since last 7 

/* cache write operation, but is on a hash 7 
T queue somewhere 7 

#define CACHG 0x01 f Cache entry has changed since last cache 7 

r write operation 7 

#define CACLEAN 0x02 I* Cache entry has not been used since last 7 

/* clean up operation 7 

r 

* This structure will generally be referred to as part 1 of the cache 

* In order to stay long word aligned this structure has a 2 byte pad. 

* This reduces the number of cache entries available in the cache. 

7 

struct ca_mwc_dp ( I* cache mirror write consistency disk part 7 

ulong Ivjtg; /*LV logical track group 7 

ushort lv_minor; I* LV minor number " 7 

short pad; 

I; 

#define MAX_CA_ENT 62 /* Max number that will fit in block 7 
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r 

* This structure must be maintained to be 1 block in length(512 bytes). 

* This also implies the maximum number of write consistency cache entries. 

struct mwcjec { /* mirror write consistency disk record V 

struct timestrucj btmstamp; /* Time stamp at beginning of block*/ 
struct ca_mwcjdp ca_p Kf[MAX_CA_ENT]; f Reserve 62 part 1 structures 
^ struct timestrucj ejmstamp; I* Time stamp at end of block 7 

r 

* This structure is used by the MWCM. It is hung on the PV cache write 

* queues to indicate which Ibufs are waiting on any particular PV. The 

* define controls how much memory to allocate to hold these structures. 

* The algorithm is 3 * CA_MULT * cache size * size of structure. 
7 

#defineCA_MULT 4 /* pv_wait * cache size multiplier 7 

struct pv_wait { 

struct pv_wait *nxt_pv_wait; I* next pv_wait structure on chain */ 
struct buf *lb_wait; Tptr to Ibuf waiting for cache 7 

}; 
r 

* LVM function declarations - arranged by module in order by how they occur 

* in said module. 
7 

#ifdef_KERNEL 
#ifndef_NO_PROTO 

r hd_mircach.c 7 

extern int hd_ca_ckcach ( 

register struct buf *lb, r current logical buf struct 7 

register struct volgrp *vg, Tptr to volgrp structure 7 

register struct Ivol *lv); /* ptr to Ivol structure 7 
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extern void hd_ca_use ( 

register struct volgrp *vg, /* ptr to volgrp structure V 
register struct ca_mwcmp *ca_errt/ cache entry pointer V 
register inthj); /* head/tail flag 7 

extern struct cajnwcmp *hd_ca_new ( 

register struct volgrp *vg); f* ptr to volgrp structure */ 

extern void hd_ca_wrt (void); 

extern void hd_ca_wend ( 

register struct pbuf *pb); f Address of pbuf completed V 

extern void hd_ca_sked( 

register struct volgrp *vg, /* ptr to volgrp structure 7 

register struct pvol *pvol); Tpvol ptr for this PV 7 

extern struct ca_mwc_mp *hd_ca_fnd( 

register struct volgrp *vg, f* ptr to volgrp structure 7 
register struct buf *lb); /* ptr to Ibuf to find the entry*/ 

r for 7 

extern void hd_ca_clnup( 

register struct volgrp *vg); f* ptr to volgrp structure 7 

extern void hd_ca_qunlk( 

register struct volgrp *vg, /*ptr to volgrp structure 7 
register struct ca_mwc_mp *ca_ent); I* ptr to entry to unlink 7 

extern int hd_ca_pvque( 

register struct buf *lb, /* current logical buf struct 7 

register struct volgrp *vg, /* ptr to volgrp structure 7 

register struct Ivol *lv); I* ptr to Ivol structure 7 

extern void hd_ca_end ( 

register struct pbuf *pb); 7 physical device buf struct 7 

extern void hd_ca_term ( 
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register struct but *lb); I* current logical but struct 7 

extern void hdjcajnvhld ( 

register struct volgrp *vg); f ptr to volgrp structure 7 

f hd_dump.c7 

extern int hd_dump ( 

devj dev, f major/minor of LV 7 

struct uio *uiop, r ptr to uio struct describing operation*/ 

int cmd, /* dump command 

char *arg, I* cmd dependent - ptr to dmpjjuery struct*/ 

int chan, /* not used * 7 

int ext); /* not used 7 

extern int hd_dmpxlate( 

register devjt dev, f major/minor of LV 7 

register struct uio *luiop, /* ptr to logical uio structure 7 

register struct volgrp *vg); I* ptr to VG from device switch table*/ 

rhd_top.c7 

extern int hd_open( 

devjdev, f device number majorminor of LV to be opened 7 
int flags, I* read/write flag /* 



int chan, I* not used 
int ext); f not used 



r 
r 



extern int hd__allocpbuf (void) ; 



extern void hd_pbufdmpq( 
register struct pbuf *pb, 
register struct pbuf **qq); 



/* new pbuf for chain 7 
r Ptr to queue anchor 



7 



extern void hd_openbkout( 

* i i • i 



int 



bopoint, I* point to start backing out 



7 
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struct volgrp *vg); /* struct volgrpptr 7 

extern void hd _backout( 

int bopoint/ point where error occurred & need to 7 
r backout all structures pinned before */ 
/* this point V 
struct tvol *lv, r ptr to Ivol to backout V 
struct volgrp *vg); f struct volgrp ptr V 

extern int hd_close( 

devj dev. /* device number major.minor of LV to be closed 7 
int chan, /* not used 7 
int ext);/* not used 7 

extern void hd_ygcleanup( 

struct volgrp *vg); /* struct volgrp ptr 7 



extern void hd_frefrebb(void); 

extern int hd_allocbblk(void); 

extern int hd_read( 

devj dev, f num major.minor of LV to be read 7 

struct uio *uiop, /* pointer to uio structure that specifies 7 
/* location & length of caller's data buffer*/ 
int chan, /* not used 7 
int ext); /* extension parameters 7 

extern int hd_write( 

devj dev, f num major.minor of LV to be written 7 

struct uio *uiop, f pointer to uio structure that specifies 7 
I* location & length of caller's data buffer*/ 
int chan, I* not used 7 
int ext); f extension parameters 7 

extern int hd_mincnt( 

struct buf *bp, I* ptr to pbuf struct to be checked 7 
void *minparms); I* ptr to ext value sent to uphysio by*/ 
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r hd_read/hd_write. 7 

extern int hd_ioctl( 

devj dev. /* device number major.minor of LV to be opened 7 
int cmd/ specific ioctl command to be performed 7 
int arg, Taddr of parameter blk for the specific cmd 7 
int mode, I* request origination 7 
int chan, f not used 7 
int ext);/*notused7 

extern struct mwcjec * hd_alloca(void); 

extern void hd_dealloca( 

register struct mwcjec *ca_ptr); /* ptr to cache to free 7 

extern void hd_nodumpvg( 
struct volgrp *); 

r hd_phys.c 7 

extern void hd_begin( 

register struct pbuf *pb, /* physical device but struct 7 
register struct volgrp *vg); /* physical to volgrp struct 7 

extern void hd_end( 

register struct pbuf *pb); /* physical device buf struct 7 

extern void hd_resume( 

register struct pbuf *pb) ; I* physical device buf struct 7 

extern void hd_ready( 

register struct pbuf *pb) ; I* physical request buf 7 

extern void hd_start(void); 

extern void hd_gettime( 
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register struct timestrucj *oJme); /* oldtime 7 
r hd_bbrel.c V 

extern inthd_chkblk( 

register struct pbuf *pb); f physical device but struct 7 

extern void hd_bbend( 

register struct pbuf *pb); /* physical device but struct V 

extern void hd_baddone( 

register struct pbuf *pb); I* physical request to process 7 

extern void hd_badblk( 

register struct pbuf *pb); I* physical request to process 7 

extern void hd_swreloc( 

register struct pbuf *pb); I* physical request to process 7 

extern daddrj hd_assignalt( 

register struct pbuf *pb); /* physical request to process */ 

extern struct bad_blk *hd_fndbbrel( 

register struct pbuf *pb); f physical request to process 7 

extern void hd_nqbblk( 

register struct pbuf *pb); /* physical request to process 7 

extern void hd_dqbblk( 

register struct pbuf *pb, /* physical request to process 7 
register daddrj blkno); 

/*hd_sched.c7 

extern void hd_schedule(void); 

extern int hd_avoid( 

register struct buf *lb, I* logical request buf 7 
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register struct volgrp *vg); f VG volgrp ptr 7 

extern void hd_resyncpp( 

register struct pbuf *pb); /* physical device buf struct 7 

extern void hd_freshpp( 

register struct volgrp *vg, /* pointer to volgrp struct 7 
register struct pbuf *pb); /* physical request buf 7 

extern void hd_mirread( 

register struct pbuf *pb); I* physical device buf struct 7 

extern void hdjxupf 

register struct pbuf *pb); /* physical device buf struct 7 

extern void hd_stalepp( 

register struct volgrp *vg, I* pointer to volgrp struct 7 
register struct pbuf *pb); "* /* physical device buf struct 7 

extern void hd_staleppe( 

register struct pbuf *pb) ; I* physical request buf 7 

extern void hd_xlate( 

register struct pbuf *pb, /* physical request buf 7 

register int mirror, I* mirror number 7 

register struct volgrp *vg);/*VG volgrp ptr 7 

extern int hd_regular( 

register struct buf *lb, r logical request buf 7 
register struct volgrp *vg); f* volume group structure 7 

extern void hd_finished( 

register struct pbuf *pb); /* physical device buf struct 7 

extern int hd_sequential( 

register struct buf *lb, I* logical request buf 7 
register struct volgrp *vg); /* volume group structure 7 
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extern void hd_seqnext( 

register struct pbuf *pb); t physical request but */ 
register struct volgrp *vg); /* VG volgrp pointer 7 

extern void hd_seqwrite( 

register struct pbuf *pb); f physical device but struct 7 

extern inthd_parallel( 

register struct but *Ib, r logical request buf */ 
register struct volgrp *vg); /* volume group structure 7 

extern void hdjreeallf 

register struct pbuf *q); f write request queue 4 */ 

extern void hd_append( 

register struct pbuf *pb, f physical request pbuf 7 
register struct pbuf **qq); fPtr to write request queue anchor 7 

extern void hd_nearby( 

register struct pbuf *pb, I* physical request pbuf 7 

register struct buf *lb, r logical request buf 7 

register int mask, f mirrors to avoid 7 

register struct volgrp *vg, /* volume group structure 7 
register struct Ivol *lv); 

extern void hdj)arwrite( 

register struct pbuf *pb); /* physical device buf struct 7 

r hd jtratc 7 

extern void hd_strategy( 

register struct buf *lb); f input list of logical buf structs 7 

extern void hdjnitiate( 

register struct buf *lb); . /* input list of logical buf s 7 
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extern struct buf *hd_reject( 

struct buf *lb, r offending buf structure 7 
int errno); I* error number 7 

extern void hd_quiescevg( 

struct volgrp *vg); /* pointer from device switch table 7 

extern void hd_quiet( 

devj dev, /* number major,minor of LV to quiesce 

struct volgrp *vg); /* ptr from device switch table 7 

extern void hd_redquiet( 

devj dev, r number major.minor of LV */ 

struct hdjvred *redjst); r ptr to list of PPs to remove 7 

extern int hd_add2pool( 

register struct pbuf 'subpool, 1* ptr to pbuf sub pool 7 
register struct pbuf *dmpq); /*ptr to pbuf dump queue 7 

extern void hd_deallocpbuf (void) ; 

extern int hdjiumpbufs(void); 

extern void hdjerminate) 

register struct buf *lb); /* logical buf struct 7 

extern void hd_unblock( 

register struct buf 'next, r first request on hash chain 7 
register struct buf *lb); r logical request to reschedule*/ 

extern void hd_quelb ( 

register struct buf *lb, J* current logical buf struct 7 
register struct hd_queue*que); I* queue structure pit 7 

extern int hd_kdisjnitmwc( 

struct volgrp Vg); r volume group pointer 7 
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extern int hd_kdis_dswadd( 

register devj device, /* device number of the VG 7 

register struct devsw 'devsw); f address of the devsw entry 7 

extern int hd_kdis_chgqrm( 

struct volgrp *vg I* volume group pointer 7 
short * newqrm); /* new quorum count 7 . 

extern int hd_kproc(void); 

/* hdj/gsa.c 7 

extern inthd_sa_strt( 

register struct pbuf *pb, f physical device buf struct 7 

register struct volgrp *vg, /* volgrp pointer 7 

register int type); f type of request 7 

extern void hd_sa_wrt( 

register struct volgrp *vg); /* volgrp pointer 7 

extern void hd_saJodone( 

register struct buf *ib) ; /* ptr to Ibuf in VG just completed 7 

extern void hd_sa_cont( 

register struct volgrp *vg, I* volgrp pointer 7 
register int sajjpdated); * I* ptr to Ibuf in VG just completed 7 

extern void hd_sa_hback( 

register struct pbuf 'head _ptr, I* head of pbuf list 7 

register struct pbuf *new_pbuf); f ptr to pbuf to append to list 7 

extern void hd_sa_rtn( 

register struct pbuf *head_ptr, I* head of pbuf list 7 

register int errjg); I* if true return requests with 7 

/* ENXIO error 7 



extern int hd_sa_whladv( 
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register struct volgrp *vg, f volgrp pointer 7 
register int cjwhljdx); f current wheel index 7 

extern void hd_sa_update( 

register struct volgrp *vg); I* volgrp pointer 7 

extern int hd_sa_qrmchk( 

register struct volgrp *vg); f volgrp pointer 7 

extern int hd_sa_config( 

register struct volgrp *vg, /* volgrp pointer 7 

register int type, I* type of hdjconfig request 7 

register caddrj arg); /* ptr to arguments for the request 7 

extern int hd_sa_onerev( 

register struct volgrp *vg, I* volgrp pointer 7 

register struct pbuf *pv, I* ptr pbuf structure 7 

register int type); I* type of hdjconfig request 7 

extern void hd_bldpbuf( 

register struct pbuf *pb, I* ptr to pbuf struct 7 
register struct pvol *pvol, /'target pvol ptr I* 
register int type, I* type of pbuf to build 7 
register caddrj buffaddr/ data buffer address -system 7 
register unsigned cnt, r length of buffer 7 
register struct xmem *xmem, I* ptr to cross memory descriptor 7 
register void (*sched)()) ; T ptr to function ret void 7 

extern int hd_extend( 

struct sa_ext *saext); I* ptr to structure with extend info 7 

extern void hd_reduce( 

struct sajed *sared, I* ptr to structure with reduce info 7 
struct volgrp *vg); I* ptr to volume group structure 7 

r hd_bbdir.c 7 

extern void hd_upd_bbdir( 
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register struct pbuf *pb); /* physical request to process 7 

extern void hd_bbdirend( 

register struct pbuf *vgpb); /* ptr to VG bb_pbuf 7 

extern void hd_bbdirop( void ); 

extern int hd_bbad( 

register struct pbuf *vgpb); T ptr to VG bb _pbuf 7 

extern int hd_bbdel( 

register struct pbuf *vgpb); f ptr to VG bb_pbuf 7 

extern int hd_bbupd( 

register struct pbuf *vgpb); /* ptr to VG bb_pbuf 7 

extern void hd_chkbbhld( void ); 

extern void hd_bbdirdone( 

register struct pbuf *origpb); r physical request to process */ 

extern void hd_logerr( 

register unsigned id, f original request to process 7 
register ulong dev, r device number V 
register ulong arg1, 
register ulong arg2); 

#else 

r See above for description of call arguments 7 
r hd_mircach.c 7 

extern int hd_ca_ckcach ( ); 
extern void hd_ca_use ( ); 

extern struct cajnwcmp " *hd_ca_new ( ); 
extern void hd_ca_wrt(); 
extern void hd_ca_wend ( ); 
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extern void hd_ca_sked ( ); 
extern struct ca_mwc_mp *hd_cajnd (); 

extern void hd_ca_clnup ( ) ; 

extern void hd_ca_qunlk ( ); 

extern int hd_ca_pvque ( ); 

extern void hd_ca_end(); 

extern void hdjcajerm ( ); 

extern void hd__ca_mvhld ( ); 

r hd_dump.c V 



extern int hd Jump ( ); 

extern int hdjmpxlate ( ); 

r hdjop.c V 

extern int hd_open ( ); 

extern int hd_allocpbuf(); 
extern void hd_pbufdmpq ( ); 

extern void hdj)penbkout ( ); 

extern void hd_backout ( ); 

extern int hd_c!ose ( ); 

extern int hd_ygcleanup ( ); 
extern void " hdjfrefrebb ( ); 

extern int hd_allocpbblk ( ); 

extern int hdjead ( ); 

extern int hd_write(); 

extern int hd__mincnt ( ); 

extern int hdjoctl ( ); 
extern struct mwcrec *hd_alloca ( ); 
extern void hd_dealloca ( ); 

extern void hdjiodumpvg ( ); 

r hdj)hys.c V 

extern void hd_begin(); 
extern void hd_end ( ); 

extern void hdjesume ( ); 
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extern void hdjeady ( ); 

extern void hd_start(); 

extern void hdjjettime ( ); 

r hd_bbrel.c 7 

extern int hd_chkblk(); 

extern void hdj)bend ( ); 

extern void hd_baddone ( ); 

extern void hdjbadblk(); 

extern void hd_swreloc ( ); 

extern daddrj hd_assignalt ( ); 
extern struct bad_blk *hdjndbbrel ( ); 

extern void hd_nqbblk(); 

extern void hdjdqbblkj); 

r hdjched.c 7 



extern void 
extern int 
extern void 
extern void 
extern void 
extern void 
extern void 
extern void 
extern void 
extern int 
extern void 
extern int 
extern int 
extern void 
extern int 
extern void 
extern void 
extern void 
extern void 



hd_schedule(); 
hdjavoid ( ); 

hdjesyncpp ( ); 

hdjreshpp(); 

hdjninead ( ); 

hdjixup ( ); 

hd_stalepp(); 

hd_stalepp ( ); 

hd_xlate ( ); 
hd_regular(); 

hd_finished ( ); 
hd_sequential ( ); 
hd_seqnext ( ); 

hd_seqwrite ( ); 
hd_parallel ( ); 

hdjreeall ( ); 

hd_append ( ); 

hdjiearby ( ); 

hd_parwrite ( ); 
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r hd_strat.c 7 



extern void hd_strategy ( ); 

extern void hdjnitiate ( ); 
extern struct but *hd_reject ( ); 

extern void hd_quiescevg ( ); 

extern void hd_quiet(); 

extern void hdjedquiet ( ); 

extern int hd_add2pool ( ); 

extern void hd_deallocpbuf ( ) ; 

extern int hdjiumpbufs ( ); 

extern void hdjerminate ( ); 

extern void hd_unblock ( ); 

extern void hd_quelb ( ); 

extern void hd_kdis_dswadd ( ); 

extern void hd_kdis_initmwc ( ); 

extern int hd_kdis_chgqrm ( ); 

extern int hd_kproc ( ); 

r hd_vgsa.c 7 

extern int hd_sa_strt ( ); 

extern void hd ja_wrt ( ); 

extern void hd_sa_jodone ( ) ; 

extern void hd_sa_cont ( ); 

extern void hd_sa_hback ( ); 

extern void hd_sa_rtn(); 

extern int hd_sa_whladv(); 

extern void hd_sa_update ( ); 

extern int hd_sa_qrmchk ( ); 

extern int hd_sa_config ( ); 

extern void hd_bldpbuf (); 

extern int hd_extend(); 

extern void hdjeduce ( ); 

extern void hd_sa_onerev ( ); 



r hdjbdir.c 7 
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extern void 
extern void 
extern void 
extern int 
extern int 
extern int 
extern void 
extern void 
extern void 



hd_bbadd ( ); 
hdjbbdel ( ); 
hdjbupd ( ); 



hd_upd_bbdir ( ); 
hd_bbdirend ( ); 
hd_bbdirop ( ); 



hd_chk__bbhld ( ); 
hd_bbdirdone ( ); 
hdjogerr ( ); 



#endif /*_NO_PROTO V 
#endif r .KERNEL 7 

#endiff_H_HD7 
Subject: LVM code 

static char sccsidff[ ] = "@(#)hd_vgsa.c 1 .4 com/sysx/lvm,3.1 .1 10/11/90 18:59:17"; 



COMPONENTJIAME; (SYSXLVM) Logical Volume Manager Device Driver - hd_vgsa.c 

FUNCTIONS; hd_sa_strt, hd_sa_wrt, hdjajodone, hd_sa_cont, hd_sajiback, 
hd_sa_rtn, hd_sa_whladv, hd_sa_update, hd_sajqrmchk, 
hd_sa_config, hd_bldpbuf, hd_sa_onerev, hdjeduce, hd_extend, 

© COPYRIGHTInternational Business Machines Corp. 1989, 1990 
All Rights Reserved 



hd_vgsa.c -- LVM device driver Volume Group Status Area support routines 



These routines handle the volume Group Status Area(VGSA) used 
to maintain the state of physical partitions that are copies of each 
other. The VGSA also indicates whether a physical volume is missing. 



r 
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Function: 

Execution environment 

All these routines run on interrupt levels, so they are not 
permitted to page fault They run within critical sections 
that are serialized with block I/O offlevel bdone( ) processing. 



#include < sys/types.h > 
#include < sys/errno.h > 
include < sys/intr.h > 
include < sys/malloch > 
include < sys/sleep.h > 
include <sys/hd_psn.h> 
include < sys/dasd.h > 
include < sys/vgsah > 
include < sys/hd_config.h > 
include < sys/trchkid.h > 
include < sys/hd.h > 

r 

* NAME: hd_sa_strt 

* FUNCTION: Process a new SA request Put the request on the hold list 

* (sa_hld_lst). If the wheel is not rolling start it 

* NOTES: 

* PARAMETERS: 
*DATASTRUCTS: 

* RETURN VALUE: SUCCESS or FAILURE 
V 
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int 

hd_sa_strt( 

register struct pbuf *pb, 
register struct volgrp *vg, 
register int type) 



I* physical device but struct 7 
f volgrp pointer V 
/* type of request 7 



register struct pbuf *hlst; /* temporary sajildjst ptr 7 
register int rc; /* general function return code ~ 7 

r 

* If the VG is closing dont start anything 
7 

if(vg-> flags & VGJORCEDOFF ) 
retum( FAILURE ); 

r 

* If "pb" is NULL then this is a restart from the config routines. 

* The config routines got control of the WHEEL but then found they 

* did not change anything so they just want to restart it. 

7 



r 

* Save the type of the request and hang it on the hold list 

7 

pb-> pbjype = type; 
pb-> pb.avjorw = NULL; 
if(vg->sa_hldjst){ 



if(pb)( 



r 

'Find end of list 
7 

hist = vg-> sa_hld_lst; 
whilef hlst-> pb.avjorw ) 

hist = (struct pbuf *)(hlst-> pb.avjorw); 
hlst-> pb.avjorw = (struct but *)pb; 



else 



vg-> sajildjst = pb; 
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1 

r 

* Start the wheel if not rolling already 
7 

if( !(vg-> flags & (SA WHLACT | SA_WHL_HLD)) ) { 
vg-> flags |=SA_WHL_ACT; 

r 

* Generate a cross memory descriptor - see hd_sa_wrt( ) 

* for reason why it is done here. 
7 

vg-> sa_lbuf.b_xmemd.aspace_id = XMEMJNVAL; 
rc = xmattach( vg-> vgsajrtr, sizeof( struct vgsa_area ), 

&(vg-> sajbuf.bjcmemd), SYS_ADSPACE); 
ASSERT (rc = = XMEM_SUCC); 

hd_sa_cont(vg,0); 

} 

return) SUCCESS ); 

1 

r 

* NAME: hd_sa_wrt 

* FUNCTION: Build a buf structure to do logical 10 to write the next 

* SA on the wheel. 

* NOTES: 

* PARAMETERS: 
*DATASTRUCTS: 

* RETURN VALUE: none 
7 

void 
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hd_sa_wrt( 

register struct volgrp *vg) /* volgrp pointer V 

' register struct but *lb; Tptr to Ibuf in volgrp struct */ 

register int widx; f VG wheel index V 

register int rc; /* function return code */ 

struct xmem xmemd; f area to save the xmem descriptor*/ 

widx = vg-> wheel Jdx; 

r 

* Save the cross memory descriptor then zero the but structure 

* then stuff it with the necessary fields 

* Saving the cross memory descriptor is faster than attaching/ 

* detatching on each PV write. This way we can attach when 

* the wheel is started and not detach until it stops. 

V 

lb = &(vg-> sajbuf); 
xmemd = lb-> b_xmemd; 
bzero( lb, sizeof(struct buf) ); 

lb->bjags =B_BUSY; 

lb-> bjodone = hd_sajodone; 

lb-> b_dev = makedev( vg-> majorjium, 0); 

lb->b_blkno =GFSA_LSN(vg,widx); 

lb-> b_baddr = (caddr_t)(vg-> vgsa_ptr); 

lb-> b_bcount = sizeof( struct vgsa_area ); 

lb-> b_options = REQ_VGSA; 

lb->b_event = EVENT_NULL; 

I* restore the cross memory descriptor 7 
lb-> b_xmemd = xmemd; 



* Save the wheel sequence number that is being written to this 
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*VGSA 

7 

SETSA_SEQ( vg, widx, vg-> whl_seq_num ); 

r 

* Call hd_regular( ) to translate the logical request then hd_start( ) 

* to issue it to the disk drivers. 

* NOTE: hd_regular( ) will use the embedded pbuf in the volgrp 

* structure, therefore it will never fail due to no 

* pbufs available. This also means that LVO does not 

* have to be open! 

hd_regular( lb, vg ); 
hd_start(); 

return; 

} 

r 

* NAME: hd_sajodone 

* FUNCTION: Return point for end of VGSA write operation. 

* NOTES: Process any error on the write. This means marking the 

* PV as missing. Then call hd_sa_cont( ) to start the next 

* SA write if more to do. 

* If a PV is marked as missing there is no pbuf needed to 

* remember when this happened. BECAUSE, there is no 
specific request waiting on any one particular SA write 

* request. THEREFORE, the only thing that must be done 

* is to ensure the wheel keeps rolling for at least one more 

* revolution from this point. This is done by bumping the 

* whl_seq_num variable. 

* PARAMETERS: 
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*DATASTRUCTS: 

* RETURN VALUE: none 
7 

void 

hdjajodonef 

register struct but lb) /* ptr to Ibuf in VG just completed 7 
register int sa_updated = 0; f nonzero indicates SA updated */ 
struct volgrp *vg; /*VG volgrp ptr from devsw table */ 



r get the volgrp ptr from device switch table V 
(void) devswqryf lb-> b_dev, NULL, &vg ); 

lb->bJags& = "B_BUSY; 

r 

* If error on write mark the PV missing 

V 

if(lb->b_flags&B_ERROR){ 

r 

* Change pvstate to missing. Set pvmissing flag in VGSA. Check 

* for quorum. Change VGSAtimestamp and sequence number. 

* Log an error message concerning the missing PV. 

V 

register struct pvol *pvol; r ptr to pvol of missing pv */ 

pvol = vg-> sa _pbuf.pb_pvol; 

pvol-> pvstate = PV_MISSING; 
SETSA_PVMISS( vg-> vgsa_ptr, pvol-> pvnum ); 
(void)hd_sa_qrmchk(vg); 
sa_updated = 1; 
hd_sa_update( vg ); 
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r 

error message(????); 

) 

r 

* Continue to next VGSA write 
7 

hd_sa_cont( vg, sajjpdated ); 
return; 

) 

r 

* NAME: hd_sa_cont 

* FUNCTION: Continue writing VGSA areas 

* NOTES: This function is used to start the wheel or keep it 

* rolling. The only thing that stops the wheel once 

* it is rolling is the whl_seq_num variables. When the 

* last write sa_seajium matches the next one we are 
complete. 

* If the VG is closing due to a loss of quorum then all 

* active requests are returned with errors. This will 

* result in an error being returned with the original 

* request. Because of the loss of quorum we can not 

* guarantee the VGSA was updated with the correct information. 
Any user data will be recovered by the MWC cache. 

| PARAMETERS: 
*DATASTRUCTS: 

* RETURN VALUE: none 

7 

void 
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hd_sa_cont( 

register struct vofgrp *vg fvolgrp pointer I* 
register int sa_updated) r ptr to Ibuf in VG just completed 7 

register struct pbuf *hW_req; r ptr to request being moved to 7 

r active list 

register struct pbuf *alst; I* temp sa_actjst ptr 7 
register struct buf "alstjorw; * address of sa_act_lst avjorw ptr*/ 
register struct pbuf *newjeq = NULL f ptr to the first new request 7 

r that was put on the active list 7 
register struct buf *alstjb; f ptr to Ibuf for active list pbuf*/ 
register struct buf *hldjb; r ptr to Ibuf for hold request pbuf*/ 
register int n_whljdx; r new wheel index v 7 
register int i; f general counter 7 



r 

* Put the wheel on hold if some config process wants control of it and 

* that process is not waiting for the wheel to stop. Then 

* wake that process up. Said process will restart the wheel when 

* it is finished making its changes 

* 'NOTE* It is assumed the process has everything it needs in memory 

and it is all pinned. 

7 

if( (vg-> config_wait != EVENT_NULL) && !(vg-> flags & SA_WHL_WAIT) ) { 

vg-> flags |= SA_WHL_HLD; 

vg-> flags &= "SA_WHL_ACT; 

xmdetachf &(vg-> sajbuf.bjcmemd) ); 

e_wakeup( &(vg-> config_wait) ); 
. return; 
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4 Move any requests currently on the hold list to the active list 
V 

while(vg->sa_hld_lst){ 

r Get pbuf at head of list*/ 
hldjeq = vg-> sa_hld_lst; 

vg-> sajildjst = (struct pbuf *)(hld_req-> pb.avjorw); 

hldjeq-> pb.avjorw = NULL; 
hld_req-> pb.avjback = NULL; 

r 

* Scan active list for any request that is doing the same 

* type of request on the same PPs/PVs. If one is found 

* then hang this request on the avj>ack lisL Thus, this 

* request will be allowed to continue when the head of the 

* avjback list is allowed. 
7 

alst=vg-> sa_actjst; 

alstjorw = (struct buf ")(&(vg-> sa_act_lst)); 

r 

* Scan the active list until the end or we find a match 
7 

while( hldjeq && alst ) { 
if ( alst-> pbjype != hld_req-> pbjype ) { 
alstjorw = (struct buf **)(&(alst-> pb.avjorw)); 
alst = (struct pbuf *)(alst-> pb.avjorw); 
continue; 

} 

switch( alst-> pbjype ) { 

case SA_PVMISSING: 
case SA_PVREMOVED: 

r 

* Check the pvol addresses in the pbufs 

7 
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if( alst-> pb_pvol = = hld_req-> pb_pvol ) { 

r 

* We have a match. Hang the new request 

* on the ay__back list. 

7 

hd_sa_hback( alst, hldjeq ); 
hldjeq = NULL; 

} 

break; 
caseSA_STALEPP: 



* Check that the device number(b_dev) are the same 

* in the corresponding ibufs. Then that the LPs 

* are the same. And finally the actual mirrors. 
*/ 

alstjb = a!st-> pbjbuf ; 

hldjb = hldjeq-> pbjbuf; 

if ( (alstJb-> b_dev = = hldjb- b_dev) && 
(BLK2PAm"(vg-> partshift, alstJb-> b_blkno) = = 
BLK2PART(vg-> partshift, hld>> bjblkno)) ) { 

r 

* Check mirrors - if a mirror is stale on the 

* active list pbuf but not in the new request 

* pbuf count it as a match. If the bits are 

* reversed the new request must be put on the 

* active list (avjorw) since it must wait 

* for the PP to be marked as stale. 
7 

for( i=0; i < MAXNUMPARTS; i + + ) ( 
if( (alst-> pbmirbad & (1 < < i)) -, 
(hld_req-> pbmirbad & (1 < < i)) ) { 

if( l(alst-> pbmirbad & (1 < < i)) ) { 
break; 

} 
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iff i = = MAXNUMPARTS ) { 

r 

* We have a match. Hang the new request 

* on the av_back list 

7 

hd_sa_hback( alst, hldjeq ); 
hldjeq = NULL; 

I 1 
break; 

case SA_FRESHPP: 
caseSA_CONFIGOP: 

r 

* Since there can only be one resync operation per 

* LP all fresh PP operations must be unique. 

* Therefore, we can go directly to the end of 

* the active list. 
* 

* The same thing holds true for config operations. 

* There can only be one active in the VG at a time. 
7 

break; 
default: 

panic("hd_sa_cont: unknown pbuf type"); 
} /* END switch on pb type 7 

r 

* If the new request pointer is NULL then the request was 
put on the av_back list and we can cany on. Otherwise, 

* we must look further down the av forw list 
7 
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if( hldjeq ) ( 
alstjorw = (struct but * *)(&(alst-> pb.av_forw)); 
alst = (struct pbuf *)(alst-> pb.av_forw); 

} T END whilef hldjeq && alst ) V 

r 

* If alst is NULL we are at the end of the active list. 

4 Put the new request on the list and modify the VGSA as per 

* the type of request 

7 

if( lalst ) { 
*alstJorw = (struct buf *)hld_req; 

r 

* If the timestamp on the memory version of the VGSA has 

* not been bumped do it now. Then remember the address of 

* this first pbuf to be added to active list this pass. 

if( !sa_updated ) { 
sa_updated = 1; 
hd_sa_update(vg); 

if( Inewjeq) 
newjeq = hldjeq; 

switch( hldjeq-> pbjype ) { 

register struct Ivol *lv; /* ptr to Ivol structure */ 
register struct part 'part; f ptr to PP part structure*/ 
register ulong Ip; I* request LP number V 
register ulong pp; I* mirror PP number 7 
register int mirrors/* mirror mask for action 7 
register int i; /* general 7 

case SA_PVMISSING: 
case SA PVREMOVED: 
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r 

* Change pvstate to missing. Set pvmissing flag in 

* VGSA. (If removed PV, update the VG's quorum count. 

* before it is rechecked.) Check the quorum. 

*^Log an error message concerning the missing/removed PV. 

hld_req-> pb _pvol-> pvstate = PV_MISSING; 
SETSA_PVMISS( vg-> vgsajrtr, hldjeq-> pb_pvol-> pvnum ); 
if ( hid jeq-> pbjype = = SA_PVREMOVED ) 

vg-> quorum_cnt = hld_req-> pb.bjvork; 
(void) hd_sa_qrmchk( vg ); 

r 

error message(????) 



break; 

case SA_STALEPP: 
case SA JRESHPP: 

r 

* For SA_STALEPP the pb_mirbad field in the pbuf 

* indicates which mirrors should be marked as 

* stale. For SAJRESHPP the pb_mirdone field in 

* the pbuf indicates which mirrors should be made 

* fresh(active). 

* Find the LV Ivol structure and LP number of the 

* logical request. 

if( hld_req-> pbjype = = SA_STALEPP ) 

mirrors = hld_req-> pbmirbad; 
else 

mirrors = hld_req-> pbmirdone; 
hldjb = hld_req-> pbjbuf; 
Iv = VG_DEV2LV( vg, hld>> b_dev ); 
Ip = BLK2PART( vg-> partshift, Nd>> bblkno ); 
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* Now scan the mirrors bits and for each one that 

* is set log an error message concerning the 

* operation then set/reset corresponding 

* bit in the in memory version of the VGSA. 
V 

while( mirrors ) { 
i = FIRST_MASK( mirrors); 
mirrors &* ("(MIRROR JAASK(i))); 
part = PARTITION! Iv,lp,i); 
pp = BLK2PART(vg->partshift, 

part-> start - part-> pvol-> fst_usr_blk ); 
iff hld_reo r > pbjype = = SA_STALEPP ) { 

r 

error message(????) STALE pp + 1 
V 

SETSA_STLPP(vg-> vgsaj)tr,part-> pvol-> pvnum.pp); 

1 

else { 

r 

error message(????) FRESH pp + 1 

7 

CLRSA_STLPP(vg-> vgsaj)tr,part-> pvol-> pvnum.pp); 



break; 

case SA_CONFIGOP: 

r 

* No action needed on a hd_config routine request. 

* the in memory version was modified when the wheel 

* was put on hold and control passed to the config 

* routines. 
V 

break: 
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default: 

panicfhd_sa_cont unknown pbuf type"); 

} r END of switch on pbjype */ 
} /* END of if ( lalst ) */ 
} T END while( sa_hld_lst ) */ 

r 

* At this point everything is on the active list and the appropriate 

* action taken. If we have lost a quorum due to said action then 

* return all requests on the active lists with errors(ENXIO) if 

* they do not currently have an error indicated. Before getting out 

* clear the active and hold flags and detach the VGSA memory area 
7 

if( vg-> flags & VGJORCEDOFF ) { 
while(vg-> sa_act_lst){ 
r Get pbuf at head of list*/ 
alst=vg-> sa_act_lst; 

vg-> sa_act_lst = (struct pbuf *)(alst-> pvavjorw); 
hd_sa_rtn(alst,RTN_ERR); 

vg-> flags & = C(SA_WHL_ACT | SA_WHL_HLD)); 
xmdetach( &(vg-> sa_lbuf.b_xmemd) ); 

* If the wait flag is on then a config function is waiting for 

* the wheel to stop. So, inform that function that is has. This 

* is used so the varyoffvg function will wait, if the wheel is 

* rolling, before removing the data structures. 

7 

if ( vg-> flags & SA_WHL_WAIT ) { 
vg-> flags & = "SA_WHL_WAIT; 
e_wakeup( &(vg-> config_wait) ); 

I 

return; 

) r END iff VG closing ) 7 
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* Now see if any request should get off of the wheel. This algorithm 

* assumes that a VGSA can not be removed from the wheel if anyone is 

* using it as a stopping point 

7 

njwhljdx = hd_sa_whladv( vg, vg-> wheeljdx ); 
while( (vg-> sa_actjst) && (newjeq != vg-> sa_act_lst) ) { 
if(vg->sa_act-lst-> pb whl stop = = n_whl Jdx ) { 

r 

* Time for this request to get off the wheel and continue 
V 

alst=vg-> sa_actjst; 

vg-> sa_actjst = (struct pbuf *)(a!st-> pb.avjorw); 
hd_sajtn( alst, RTN.NORM ); 

else if( (vg-> pvolsff[n_whl jdx > > 1]-> pvstate = = PVJIISSING) || 
(NUKESA( vg, n_whljdx) ==TRUE) ) { 

r 

* If the next wheel index is on a missing PV, i. e. an 

* inactive VGSA, advance to the next wheel index and see 

* if any request should get off at it Also, if the SA 

* is to be removed(nuked) then do it now. 

* *NOTE* We should never get here is we lose a quorum. As 

a safety measure the assert is in place to prevent 
an infinite loop. If we go completely around the 
wheel without finding an active VGSA we have a 

* problem somewhere. 
7 

if( NUKESA( vg, njvhl jdx) = = TRUE ) { 
SETSA_LSN(vg,n_whlJdx,0); 
SET_NUKESA( vg, n_whljdx, FALSE); 

n_whl jdx = hd_sa_whladv( vg, n_whl jdx ); 
assert( n_whl jdx != vg-> wheeljdx ); 

} 

else 
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break; 

} 

r 

* We got out of the last loop under 1 of 3 conditions 

* 1. The active list was empty. 

* 2. The head of active list points to a request that was just added. 
3. The head of active list has a stopping point further around 

the wheel and we are at the next active VGSA to write. 

* At this point we must make sure the next wheel index(n_whl_idx) is 
pointing at an active VGSA. i.e. we came out because of condition 

* 1 or 2. If the VGSA is inactive advance to the next active one 

* and then set the pb_whl_stop fields of any new requests. Thus, no 

* request gets put on the wheel at an inactive VGSA. 
7 

whilef ( vg-> pvolsff[n_whl jdx > > 1]-> pvstate = = PV MISSING) II 
(NUKESA( vg, njvhljdx) = = TRUE) ) { 

f 

* *NOTE* We should never get here if we lose a quorum. As 

* a safety measure the assert is in place to prevent 
an infinite loop. If we go completely around the 
wheel without finding an active VGSA we have a 
problem somewhere. 

V 

iff NUKESA( vg, n_whljdx) = = TRUE ) ( 

SETSA_LSN( vg, n_whl_idx, 0); 
^ SET_NUKESA( vg, n.whljdx, FALSE); 

njvhljdx = hd_sa_whladv( vg, n_whl_idx ); 
assert( njvhljdx != vg-> wheeljdx ); 

while( newjeq ) { 

new_req-> pb_whl_stop = njvhljdx; 
newjeq = (struct pbuf *)(newjeq-> pv.avjorw); 



f Save the next wheel index 7 
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vg-> wheeljdx = n_whljdx; 

r 

* Check to see if the current VGSA sequence number has been written 

* to the next VGSA. If it has not then write it If it matches 

* then we have written the latest SA to ail available VGSAs so 

* stop the wheel. 

V 

if( vg-> whl_seq_num != GETSA_SEQ( vg, njwhljdx ) ) 

hd_sajyrt(vg); 
else ( 

vg-> flags &="SA_WHL_ACT; 
xmdetach( &(vg-> sajbuf.bjcmemd) ): 

r 

* If the wait flag is on then a config function is waiting for 

* the wheel to stop. So, inform that function that it has. This 

* is used so the varyoffvg function will wait, if the wheel is 

* rolling, before removing the data structures. 
7 

if( vg-> flags &SA_WHL_WAIT){ 
vg-> flags & = "SA_WHL_WAIT; 
e_wakeup( &(vg-> config_wait) ); 



* Just in case anything was unblocked or the cache hold queue was 

* moved to the pending_Q 

hd_schedule ( ); 
return; 

} 

r 

* NAME: hd_sa_hback 
» 

* FUNCTION: Hang a pbuf on the end of the given av_back list 

* 
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* NOTES: This function is used to find the end of the given pbuf 

* list via the avback pointer. Then, link the new pbuf 
on to the list there. Assumes the avback pointer in the 

* new pbuf is NULL 

* PARAMETERS: 

* DATASTRUCTS: 

* RETURN VALUE: none 
7 

void 

hd_sa_hback( 

register struct pbuf *head_ptr, f head of pbuf list » V 
register struct pbuf *new_pbuf) fptr to pbuf to append to list 7 

while( head_ptr-> pb.avback ) 

head_ptr = (struct pbuf *)(head_ptr-> pb.av_back); 

head_ptr-> pb.av_back = (struct buf *)new_pbuf; 

return; 

} 

r 

* NAME: Irijsartn 

* FUNCTION: Return the given av back list of request to their 

respective caller. 

| NOTES: 

* PARAMETERS: 

; DATASTRUCTS: 

* RETURN VALUE: none 
7 
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void 

hd_sajtn( 

register struct pbuf *head_ptr, f head of pbuf list 
register int errjg) I* if true return requests with 

r ENXIO error 

( 

register struct pbuf *lst_ptr; /* anchor for ay_back list 
while) head_ptr){ 

r 

* piggybacked requests are on the av_back chain 

7 

lst_ptr = (struct pbuf *)(head_ptr-> pb.avjback); 

r 

* if the request should be returned with an error but the 

* B_ERROR flag is off TURN IT ON. Dummy up address so it 

* looks like none of the request worked 

7 

if( (err Jg = = RTNJRR) && (!(head_ptr-> pb.bjags & BJRROR)) ) ( 
head_ptr-> pb.bjags | = BJRROR; 
head_ptr-> pb.b_error = EIO; 
head_ptr-> pb_addr = head_ptr-> pb_lbuf-> bjaddr; 



I* Set the B_DONE flag to indicate the request is done 7 
head_ptr-> pb.bjags |= BJONE; 

r 

* return the request via wakeup or function call 

* it is possible for b_event to still be EVENTJIULL because of 

* some error and pb_sched to be NULL If this condition exists 

* just drop the request and the caller will see it is complete 

* by checking the B_DONE 

if( headj)tr-> pb.b_event != EVENT.NULL) 
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e_wakeup( &(head_ptr-> pb.b_event) ); 
else if( head_ptr-> pb_sched ) 
HD_SCHED(head_ptr); 

r 

* get the next one off of the list 
V 

headj)tr=lst_ptr; 
}/ t ENDwhile(head_ptr)V 
return; 

} 

V 

r 

* NAME: hd_sa_whladv 

* FUNCTION: Advance wheel index to next VGSA 

* NOTES: The wheel index has 2 components. A primary/secondary 

bit, the low order bit of the index. This controls which 
VGSA is being indexed on any particular PV. The second 
component is the PV index. It is the remaining bits of 
index. It is used as the index into the pvols array in 
the volgrp structure. This mechanism assumes that the 
maximum number of PVs in a VG is a power of 2. 

If MAXPVS is a power of 2 this function will be much 
more efficient. 

* PARAMETERS: 

* DATASTRUCTS: 

* RETURN VALUE: next VGSA on the wheel 
7 

int 

hd_sa_whladv( 
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register struct volgrp *vg, Tvolgrp pointer 7 

register int c_whl Jdx) f current wheel index */ 

c_whl_idx + + ; 
while( 1 ) { 

c_whlJdx% = (MAXPVS*2); 

r 

* If no pvol pointer then advance index to next PV. 

* If pvol pointer then look to see if there is a logical sector 

* number associated with the index. If so we have found the 

* next VQSA index. If not bump the index and look again. 
V 

if( !(vg-> pvolsff[ c_whl_idx > > 1 ]) ) ( 

r 

* If the index is odd just bump it by 1 to get to next PV. 

* If it is even bump the index by 2 to get to the next P V. 
v 

iff cwhljdx & 1 ) 

c_whl_idx+ = 1; 
else 

c_whl Jdx + = 2; 

} 

else iff GETSA_LSN( vg, c.whljdx ) ) 

break; 
else 

c_whljdx + = 1; 

1 

return) c_whljdx); 

) 

r 

* NAME: hd_sa_update 

* FUNCTION: Update the in memory version the VGSA timestamps 

and sequence number. 

FIG. 21-114 



>iSOOCID: <EP (M82853A2_L> 



EP 0 482 853 A2 



; NOTES: 

* PARAMETERS: 

; DATASTRUCTS: 

* RETURN VALUE: none 

V 

void 

hd_sa_update( 

register struct volgrp *vg) /* volgrp pointer */ 
( 

hd_gettime( &(vg-> vgsa_ptr-> b_tmstamp) ): 

vg-> vgsa_ptr-> ejmstamp = vg-> vgsa_ptr-> bjmstamp; 

I* bunp sequence number 7 

vg-> whl_seq_num + +; 

return; 

} 

r 

* NAME: hd_sa_qrmchk 

* FUNCTION: Check the VG for a quorum of SAs 

* NOTES: Count the number of active VGSAs. If the count 

is less than the threshold(quorurncnt) set the 

* VG_FORCEDOFF flag so the VG will unwind and shutdown. 

* PARAMETERS: 

* DATASTRUCTS: 

* RETURN VALUE: count of active VGSAs 
V 

int 

hd_sa_qrmchk( 
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register struct volgrp *vg) /* volgrp pointer */ 

register int act_cnt; /* count of active VGSAs */ 
register int idx; fPV index 7 

r 

* loop thru the pvols array in the volgrp structure 
*/ 

for( act_cnt=0, idx=0; idx < MAXPVS; idx + + ) { 

if( (vg-> pvolsffpdx]) & & (vg-> pvolsffpdx]-> pvstate != PV_MISSING) ) { 
iff vg-> pvolsff[idx]-> sa_areaff[0] Jsn ) 

act_cnt + + ; 
if ( vg-> pvolsff [idx]-> sa_areaff [1 J.lsn ) 
act_cnt + + ; 

) 

} 

r 

* If the VG is already closing there is not need to do this all again 
V 

if( !(vg-> flags & VG JORCEDOFF) && (act_cnt < vg-> quorurncnt) ) { 
vg-> flags | = VG_FORCEDOFF; 

r 

error message(????) Loss of quorum VG is closing 

} 



} 

r 



return) act_cnt); 



NAME: hd_sa_config 

FUNCTION: Interface for hd_config routines to access the 
VGSAwheel. 

NOTES: Assumes the hd_config routine has the VG lock. 
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* Thus preventing more than one operation at a time. 

AND 

* The arg variable(array) is in memory and PINNED. Since 
this routine may be executed during offlevel interrupt 
processing it can not page fault or rely on any disk 10. 

There are 3 phases to the hdjconfig routines modifying 
theVGSAs 

1 . Getting control of the wheel if it is rolling. 

2. Modifying the in memory VGSA. 

3. Restarting the wheel and waiting for one 

* revolution. 

This function takes care of all of these for the caller. 

* PARAMETERS: 

* DATASTRUCTS: 

* RETURN VALUE: SUCCESS or FAILURE 

V 
int 

hd_sa_config( 

register struct volgrp *vg, /* volgrp pointer V 
registering type, f type of hdjconfig request 7 
register caddrj arg) f ptr to arguments for the request 7 

register struct pbuf *pb; f ptr to a pbuf struct to use 7 

register struct pvol *pv; /* ptr to target pvol struct V 

register struct cnfg__pp_state # ppi ; 

register struct cnfg_pvjns 'pvjnfo; 

register struct cnfg_pv_del *pvdel_info; 

register struct cnfgj)v_vgsa *vgsa_info; 

register struct pvol *pvol; 

register int o_prty = -1; /* saved interrupt priority 7 

register int rc; /* function return code 7 

register int i; f general counter 7 
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register struct sa^ext *saext; /* arg for HD_KEXTEND 7 

register struct part *oldpp; /* old part structs 7 
register struct part *newpp; /* new part structs*/ 

register struct sajed *sared; /* arg for HD_KREDUCE V 

register int dearjv; I* PV missing flags have changed 7 

register int rollwheel; I* indicates we should start wheel V 

register int referable; /* shows a need to re-enable */ 
register struct extred_part *pplisty* ptr to pps to reduce */ 
struct part *oldpartsff[MAXNUMPARTS]; /* ptrs to old part structs */ 
register int ppcnt,cpcnt,ix; /* for loop indexes */ 
register struct part *pp, *ppl, 'ppnewf ptrs to part structs in reduce V 
register short ppnum; I* pp number used in reduce V 

register int copy,cpymsk,lpmsk,redpps,stlpps,statechg; 

r mask variables*/ 

r 

* If the VG is dosing return error 
*/ 

if (vg-> flags & VGJORCEDOFF) 
returnf FAILURE ); 

pb = (struct pbuf *)xmalloc(sizeof(struct pbuf),HD_ALIGN,pinned_heap); 
if( pb = = NULL ) 
return( FAILURE ); 

rc = SUCCESS; 

o_prty = MisaWeflNTIODONE); /* start critical section */ 

r 

*Do what the caller wants 
*/ 

switch) type ) { 

case HDKMISSPV: 
case HDKREMPV: 

r 
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* Assumes that only one PV at a time can be marked as 

* missing/removed. 
7 

pvdeljnfo = (struct cnfg_pv_del *) arg; 

r 

* zero out the DALVs LP on this PV 
7 

bzero( pvdelJnfo-> Ipjrtr, pvdel_info-> Ipsize); 

r 

* Go build a pbuf to give to the SA write routines. This 

* way they do all quorum checking and clean up. 

* ( If removing a PV, save the new quorum count in pbuf so 

* hd_sa_cont can update the VG's quorum count right before 

* the quorum is rechecked.) 
7 

hd_bldpbuf( pb, (struct pvol *) pvdel_info-> pv_ptr, type, 

NULL, 0, NULL, NULL); 
iff type = = HD_KREMPV){ 

pb-> pb.b_work _ pvdel_info-> qrmcnt; 

rc = hd_sa_strt( pb, vg, SA_PVREMOVED ); 

} 

gIsg 

rc= hd_sa_strt( pb, vg, SA_PVMISSING ); 
if ( rc = = FAILURE) 
break; 

r 

* If the done flag is on at this point the pbuf has been 

* completed and if we sleep the calling process will hang. 

if( !(pb->pb.bJags&B_DONE)) 
e_sleep(&(pb-> pb.b_event), EVENT_SHORT); 

r 

* If the error flag is set return FAILURE to the caller 
7 
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if( pb-> pb.b_flags & BJRROR ) 
rc = FAILURE; 

break; 

case HDKADDPV: 

r 

* perform miscellaneous tasks that must be done disabled 
7 

pvjnfo = (struct cnfgLpvJns *) arg; 
pv = (struct pvol *)(pbjnfo-> pvol); 

if (vg-> pvolsff[pv_info-> pvjdx] = = NULL) 
r set pvol structure pointer for add of a new PV 7 
vg->pvolsff[pvJnfo-> py__idx] = pv; 

else 

r copy new pvol data for add of a previously missing PV 7 
bcopy ((caddr j)pv, (caddr j)vg-> pvolsff[pv jnfo-> pvjdx], 

sizeoffstructpvol)); 

if(vg-> open_count !=0) 
hd_pvs_opn++; r bump number of open PVs 7 

r 

* If we're varying on the VG then return, 

* otherwise initialize the VGSA 

* on this new PV via the WHEEL 

7 

if (vg-> flags &VG_OPENING) 
break; 

r 

* Get control of the wheel if it is rolling. 

if(vg-> flags &SA_WHL_ACT) 
e_sleep(&(vg-> config_wait), EVENT_SHORT); 

if(vg-> flags & VG_FORCEDOFF ) { 
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rc = FAILURE; 
break; 

} 

r update VG's quorum count to include this new PV 7 
vg-> quorumjcnt = pvJnfo-> qrmcnt; 

r 

* initialize the SA.SEQJIUM to a value that will 

* make sure the VGSAon this new PV will be written, 

* and then reset the PV missing flag in the memory 
*copyoftheVGSA. 

if (pv-> sa_areaff[0].lsn) 

pv-> sajareaff[0 .sa_seq_num = vg-> whl_seq_num - 1 ; 
if (pv-> sa_areaff[l].lsn) 

pv-> sa_areaff[1] .sa_seq_num = vg-> whl__secL.num - 1 ; 

CLRSA_PVMISS( vg-> vgsaj)tr, pv-> pvnum ); 

r 

* Now force the wheel one revolution. Build a pbuf 

* to give the the wheel, reset the SA holding flag, 

* (re)start the wheel, wait for the wake up to signal 

* the wheel has completed the operation, check status. 

rc = hd_sa_onerev(vg, pb, type); 
break; 

case HD_KEXTEND: 
case HDKREDUCE: 

r 

* Get control of the wheel if it is rolling. 

•7 

if(vg-> flags & SA_WHL_ACT ) 
e_sleep(&(vg-> configjvait), EVENT_SHORT); 
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if ( vg-> flags & VGJORCEDOFF ){ 
re = FAILURE; 
break; 

} 

r 

* Now that the wheel is ours we can do what needs to be 

* done. 
V 

switch( type ){ 
case HDKEXTEND: 



* set up a pointer to the arguments passed in and loop 

* through the cnfg_pp_state structures to process the pps 

* until we come to a ppstate that is CNFGJ5TOP 

v 

saext = (struct sa_ext *) arg; 

for(ppi = saext-> vgsa;(ppi-> ppstate != CNFG_STOP);ppi + + ) { 
if((TSTSA_STLPP(vg-> vgsa_ptr,ppi-> pvnum,ppi-> pp) ? STALEPP 
FRESHPP) != ppi-> ppstate) { 
XORSA_STLPP(vg-> vgsa_ptr, ppi-> pvnum,ppi-> pp); 
rollwheeUTRUE; 



if(rollwheel = = TRUE) { f we changed the VGSA 7 

r 

* force the wheel one revolution. Build a pbuf to give 

* to the wheel, reset the SA holding flag, (re)stait 

* the wheel, wait for the wake up to signal that the 

* wheel has completed the operation, check status. 

V 

hd_bldpbuf(pb, NULL type, NULL, 0, NULL, NULL); 
vg-> flags &="SA_WHL_HLD; 
rc = hd_sa_strt( pb, vg, SA_CONFIGOP ); 
if(rc = = FAILURE) 
break; 
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r 

* If the done flag is on at this point the pbuf has 

* been completed and if we sleep, the calling process 

* will hang. 
7 

if( !(pb-> pb.b_flags & BJDONE) ) 
e_sleep(&(pb-> pb.b_event), EVENT_SHORT); 

r 

* If the error flag is set return FAILURE to the 

* caller. 
7 

if( pb-> pb.bjags & B_ERROR ) { 
rc = FAILURE; 
break; 

} r end if roll wheel == TRUE 7 

r 

* call hd_extend( ) to check for resync in progress and to 

* transfer the new Iv information to the old Iv information 
7 

rc = hd_extend(saext); 
br6<iK' 
case HD_KREDUCE: 

I* set up the needed pointers and variables 7 
sared = (struct sajed *) arg; 
rollwheel = FALSE; 
pplist = sared-> list; 
for{i = 0; i < MAXNUMPARTS; i + + ) 
oldpartsffp] = sared-> lv-> partsffp]; 

r 

* for the number of physical partitions being reduced, go through 

* the logical partitions and build masks for the pps being 

* reduced, pps that are stale, and the pps that exist; and, 

* check that there are no resyncs in progress. Once the masks 

* are built, go through and check that we arent reducing the last 

* good copy of the Ip. After this, we have finished the validation 
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* phase and can then begin the process phase in which we 

* go through and turn on the PP_REDUCING bits and the 

* PP_STALE and PP_CHGING bits in the active pps that are being 

* reduced. 
7 

for(ppcnt = 1 ; ppcnt < = sared-> numred; pplist + +, ppcnt + + ) { 
if(pplist-> mask != 0) { 
cpymsk = MIRROR_EXIST(sared-> lv-> nparts); 
Ipmsk = stipps = 0; 
redpps = pplist-> mask; 
while(cpymsk !=ALL_MIRRORS) { 
copy = FIRST_MIRROR(cpymsk); 
cpymsk |= MIRROR_MASK(copy); 
pp = PARTmON(sared-> lv,(pplist-> Ipjium - 1 ),copy); 
if(pp-> pvol) ( 
if (copy = = 0) { 
if(pp-> syncjrk != NO_SYNCTRK) { 
rc = FAILURE; 

sared-> error = CFG_SYNCER; 
break; 

i 1 

Ipmsk |= MIRROR_MASK(copy); 
if((pp-> ppstate & (PP_STALE | PP_CHGING)) = = PP_STALE) 
stipps |= MIRROR_MASK(copy); 

} r end if there is a pvol in this part struct 7 
} f end while 7 
if(rc = = FAILURE) 

break; 

r 

* if we're not reducing all of the copies of this Ip, check 

* to be sure we're not reducing the last good copy 

if(redpps -i Ipmsk) { 
/* if there are no good copies left 7 
if(!((stlpps | redpps)-. Ipmsk)) { 
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rc = FAILURE; 

sared-> error = CFGJNLPRD; 
break; 

} C end if redpps -• Ipmsk 7 

} /* end if V 
ITendforV 
if(rc = = FAILURE) 

break; 

r now that we've validated the data, we can proceed V 
pplist = sared-> list; 

for(cpymsk = pplist-> mask; cpymsk; cpymsk & = "MIRROR_MASK(copy)) ( 
copy = FIRST_MASK(cpymsk); 
pp = PARTITION(sared-> lv,(pplist-> lp_num - 1),copy); 
if((pp-> ppstate & (PP_STALE | PP_CHGING)) = = PP_STALE) 

pp-> ppstate |= PP_REDUCING; 
gIsg I 

pp-> ppstate |= (PP_STALE | PP.CHGING | PP.REDUCING); 
ppnum = BLK2PART(vg-> partshift, 

(pp-> start - pp-> pvol-> fst_usr_blk)); 
SETSA_STLPP(vg-> vgsa_ptr,pp-> pvol-> pvnum.ppnum); 
rollwheeUTRUE; 

} 

} Tend for*/ 

r If we changed the VGSA7 
if(rollwheel = = TRUE) { 

r 

* force the wheel one revolution. Build 

* a pbuf to give the wheel, reset the SA 

* holding flag, (re)start the wheel, wait for 

* the wake up to signal the wheel has completed 

* the operation, check status. 
V 

hd_bldpbuf(pb, NULL, type, NULL, 0, NULL, NULL); 
vg-> flags &="SA_WHL_HLD; 
rc = hd_sa_strt( pb, vg, SA_CONFIGOP ); 
if( rc = = FAILURE ) 
break; 
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r 

* If the done flag is on at this point the 

* pbuf has been completed and if we sleep the 

* calling process will hang. 
7 

if( !(pb-> pb.b_flags & B_DONE) ) 
e__sleep(&(pb-> pb.b_event), EVENT_SHORT); 

r 

* If the error flag is set return FAILURE to 

* the caller. 
7 

if( pb-> pb.bjags & B__ERROR ) { 
rc= FAILURE; 
break; 

} 

r 

* if the logical volume is open, then them 

* drain the logical volume : wait for all requests currently 

* in the Iv work queue to complete 

if(sared-> lv-> Ivjtatus = = LVOPEN) 
hd_quiet(makedev(vg-> major_num,sared-> min_num),vi 
} r end if rollwheel*/ 
else 

r 

* if we didn't change the VGSA, then release the inhibit 

* on the wheel and restart it if it was rolling when we 

* started 
7 

if(vg-> flags & SA_WHL_HLD) { 
vg-> flags & = "SA_WHL_HLD; 
rc = hd_sa_strt(NUU_ vg,SA_CONFIGOP); 
if(rc = = FAILURE) 
break; 
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T reset the pplist pointer to the beginning of the list 7 
pplist = sared-> list; 

r 

* call hd_reduce( ) to handle promotions and to transfer the 

* new Iv information to the old Iv information 

V 

hd_reduce(sared,vg); 
break* 

) I* END of switch( type ) */ 
break; 

case HD.KDELPV: 

pvdeljnfo = (struct cnfgj>v_del *) arg; 
pv = (struct pvol *)(pvdel_info-> pv_ptr); 

r update the VG quorum count ~ 

* For a PV to be deleted, NO partitions may be allocated, 

* therefore, we dont have to be as careful here as we 

* are with REMOVEPV when we update the quorum count. 

7 

vg-> quorurncnt = pvdel_jnfo-> qrmcnt; 

r 

If the wheel is not rolling just remove the pvol pointer 

* from the volgrp structure. If any PV missing flags should 

* be reset then reset them and roll the wheel when ready. 

* If the wheel is rolling things are not so simple. The 
pvol pointer cannot be jerked out from under the wheel if 

* a request is using it as a stopping point. Therefore, 

* mark the PV missing in the pvol structure, then issue a 

* config request to the wheel forcing the wheel to go 

one revolution. Since the PV was marked as missing before 

* the config request, it is guaranteed that no request will 
be using it as a stopping point. It is also guaranteed 

* that the wheel index will not be setting on any missing PV. 

FIG. 21-127 

162 



BNSDOCID: <EP 0482853A2_I_> 



* Therefore, at this point the pvol pointer can be removed 

* safely. 
7 

if ( vg-> flags & SA_WHL_ACT ) { 
pv-> pvstate = PVJIISSING; 

r 

* Now force the wheel one revolution. Build a pbuf 

* to give the wheel, reset the SA holding flag, 

* (rejstart the wheel, wait for the wake up to signal 

* the wheel has completed the operation, check status. 
7 

if (rc = hd_sa_pnerev(vg, pb, type) != LVDD.SUCCESS) 
break; 

} f END of if the wheel active 7 

I* zero out the VG's pvol ptr 7 
vg-> pvolsff[ pv-> pvnum ] = NULL; 

r 

* Miscellaneous updates that must be made disabled: 

* delete the DALV's LP on this PV, decrement the global 

* PV open count and update the VG's quorum count 

7 

bzero ( pvdel_info-> lp_ptr, pvdel_info-> Ipsize); 
if(vg->open_count!=0) 
hd_pvs_opn-; 

break; 

case HDKADDVGSA: 
case HD.KDELVGSA: 

vgsajnfo = (struct cnfg_pv_vgsa *) arg; 
pv = vgsa_info->pv_ptr; 

vg-> quorum_cnt = vgsa_info-> qrmcnt; 
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if(type = = HD_KADDVGSA) { 

r 

* ADDING VGSA(s) to this PV - fill in the VGSA LSNs 

* and change the PV's VGSA sequence number so this 

* PVs vgsas will be written. 
7 

if (vgsajnfo -> sajsnsff[0]) { 
pv->sa_areaff[0].lsn = vgsajnfo -> sa_lsnsff[0]; 
pv->sa_areaff[0].sajeq_num = vg -> whl_seq_num - 1 ; 

if (vgsajnfo -> sajsnsff[1]) { 
pv->sa_areaff[1].lsn = vgsajnfo -> sajsnsff[1]; 
pv->sa_areaff[1].sa_seq_num = vg -> whl_seq_num - 1 ; 



r 

* get control of the wheel and wait for it to run 

* one full revolution 

V 

if(vg-> flags &SA_WHL_ACT) 

e_sleep(&(vg-> config_wait), EVENT.SHORT); 
if( vg-> flags & VGJORCEDOFF ) { 

rc = FAILURE; 

break; 

} 

rc = hd_sa_onerev(vg, pb, type); 

} 

else{ 

r 

* DELETING VGSA(s) from this PV - if the wheel is active, 

* get control of it, set the flag for the VGSA(s) being 

* deleted, and then wait for the wheel to run one 

* revolution (the LVDD code that runs the wheel will zero 

* out the VGSA LSN when the nukesa flag is set). 

* If the wheel is NOT active, then just zero out the VGSA 

* LSN's now. 
7 

if ( vg-> flags & SA_WHL_ACT ) ( 
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e_sleep(&(vg-> config_wait), EVENT_SHORT); 

if ( vg-> flags & VGJORCEDOFF ){ 
rc= FAILURE; 
break; 

} 

if (vgsajnfo -> sajsnsff[0]) 
pv-> sa_areaff[0].nukesa = TRUE; 

if (vgsajnfo -> sa_lsnsff[1]) 
pv-> sa_areaff[1].nukesa =TRUE; 

rc = hd_sa_onerev(vg, pb, type); 

else { t the wheel is NOT rolling V 
if (vgsajnfo -> sajsnsff[0]) 

pv-> sa_areaff[0].lsn = 0; 
if (vgsajnfo -> sajsnsff[1]j 

pv-> sa_areaff[1].lsn = 0; 

} 

} 

break; 
case HDJAWCREC: 

r 

4 Just update the VGSA: 

* get control of the wheel and wait for it to run 

* one full revolution. 
7 

if ( vg-> flags & SA_WHL_ACT ) 

e_sleep(&(vg-> config_wait), EVENT_SHORT); 
if (vg-> flags & VGJORCEDOFF ) ( 

rc = FAILURE; 

break; 

) 
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rc = hd_sa_onerev(vg, pb, type); 
break; 

default: 

panic("hd_sa_config: unknown request type"); 
} r END of switch( type ) 7 

i_enable(o_prty); I* return to caller priority 7 

r Give back the memory we borrowed for the pbuf struct */ 
assert(xmfree(pb,pinned_heap) = = LVDD_SUCCESSj; 

return) rc); 



r 

NAME: hd_sa_onerev 

* FUNCTION: Force the WHEEL one revolution to update the VGSA 

* on all active PVs 

* NOTES: 

* PARAMETERS: vg -pointer to volume group 

pb -pbuf pointer 

* type - type of VGSA conf ig operation 

I DATASTRUCTS: 

* RETURN VALUE: none 

7 
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int 

hd_sa_onerev( 
register struct volgrp *vg, 
register struct pbuf *pb, 
register int type) 



r ptr to volgrp struct 
r ptr to pbuf struct 
r type of pbuf to build 



7 
7 
V 



register int rc; 



r 



* Now force the wheel one revolution. Build a pbuf 

* to give the the wheel, reset the SA holding flag, 

* (re)start the wheel, wait for the wake up to signal 

* the wheel has completed the operation, check status. 
7 

hd_bldpbuf( pb, NULL, type, NULL, 0, NULL, NULL); 
vg-> flags & = "SA_WHL_HLD; 
rc = hd_sa_strt( pb, vg, SA_CONFIGOP ); 
if( rc = = FAILURE ) 
return(rc); 

r 

* If the done flag is on at this point the pbuf has been 

* completed and if we sleep the calling process will hang. 
7 

iff !(pb-> pb.b Jags & B.DONE) ) 
e_sleep(&(pb-> pb.b_event), EVENT_SHORT); 

r 

* If the error flag is set return FAILURE to the caller 
7 

if(pb->pb.b_flags&B_ERROR) 
rc = FAILURE; 

retum(rc); 
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r 

* NAME: hd_bldpbuf 

* FUNCTION: Initialize a pbuf structure for LVDD disk io. 

* NOTES: 

* PARAMETERS: none 

* DATASTRUCTS: 

* RETURN VALUE: none 
7 

void 

hd_bldpbuf( 

register struct pbuf *pb, fptr to pbuf struct 7 

register struct pvol *pvol, I* target pvol ptr 7 
register int type, /* type of pbuf to build 7 
register caddrj bufaddr, /* data buffer address - system 7 
register unsigned cnt, I* length of buffer 7 
register struct xmem *xmem, /* ptr to cross memory descriptor*/ 
register void (*sched)()) /* point to function returning void V 

' register struct buf *lb; /* ptr to but struct part of pbuf*/ 

r 

* Zero the pbuf then stuff it with the necessary fields 
*/ 

bzero( pb, sizeof (struct pbuf) ); 

lb = (struct buf *)pb; 
if ( pvol ) 
lb-> b_dev = pbol-> dev; 

lb->b_baddr= bufaddr; 

lb-> b_bcount = cnt; 

lb-> b_event= EVENT_NULL; 
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if( xmem ) 
lb-> b_xmemd = *xmem; 

pb->pb_sched = sched; 
pb-> pbjDvol = pvol; 

switch(type){ 

I* mirror write consistency cache write type */ 
case CATYPE_WRT: 

lb-> bjodone = hd_ca_end; 
lb-> bjags = B_BUSY | B_NOHIDE; 
lb-> b_bikno = PSNJWVC.RECO; 
break; 

case HD_MWC_REC: 
case HD_KMISSPV: 
case HD_KREMPV: 
case HD_KREDUCE: 
case HD_KEXTEND: 
case HD_KADDPV: 
case HD_KDELPV: 
case HD_KADDVGSA: 
case HD_KDELVGSA: 

lb-> bjodone = NULL; 

lb->bJags=B_BUSY; 

break; 

default: 

panic("hd_vgsa: unknown pbuf type"); 
break; 

} 

return; 
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NAME: hd_extend 

FUNCTION: Transfers old part struct information to new part struct 
information. 

NOTES: 

PARAMETERS: saext pointer to information structure for the extend 
DATASTRUCTS: 

RETURN VALUE: SUCCESS or FAILURE 



int 

hd_extend( 

struct sa_ext 'saext) / pointer to extend information structure 7 



register int Ipi.cpi; f loop counters */ 

register int rc; f return code 7 

register struct part *oldpp; f pointer to old part struct */ 

register struct part 'newpp; I* pointer to new part struct V 

r 

* for the old number of logical partitions on the 

* logical volume, go through and search each possible 

* old copy. If the logical partition is not being 

* resynced, put the old part struct information 

* into the new part struct entry 
7 

rc = SUCCESS; 

for(lpi = 0; Ipi < saext- > oldjiumlps; Ipi + + ) { 
for(cpi = 0; cpi < saext- > oldjiparts; cpi + + ) { 
if(saext-> klvjrtr-> partsff[cpi] != NULL) { 
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oldpp = (struct part *)(saext-> klv_ptr-> partsff[cpi] + Ipi); 
if(oldpp-> pvol != NULL) { 
if (cpi = = 0) { 
if(oldpp-> syncjrk != NO_SYNCTRK) { 
saext-> error = CFG SYNCER; 
rc = FAILURE; 
break; 



newpp = (struct part*) 

(*(saext -> new_parts + cpi) + ipi); 
*newpp = *oldpp; 
} T end if oldpp-> pvol != NULL 
} r end if klvj)tr-> parts != NULL */ 
( r end for number of old copies 7 
jf(rc = = FAILURE) 
break; 

} r end for old number of Ips V 

r 

* if no errors were found, we can complete the 

* extend by filling in the Ivol struct with the 

* new info. 
7 

if (rc = = SUCCESS) { 

saext-> klv_ptr-> nparts = saext-> nparts; 
saext-> klv_ptr-> nblocks = saext-> nblocks; 
saext-> klv_ptr-> i_sched = saext-> isched; 
for(cpi = 0; cpi < saext-> nparts; cpi + + ) 
saext-> klvj3tr-> partsff[cpi] = 
saext-> new_partsff[cpi]; 
} T end if rc = = SUCCESS V 
returnfrc): 



NAME: hdjeduce 
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* FUNCTION: Transfers old part struct information to new part struct 

information, and handles promotion if needed. 

* NOTES: 

* PARAMETERS: sared pointer to information structure for the reduce 

vg pointer to volume group structure 

* DATASTRUCTS: 

* RETURN VALUE: none 

7 

void 

hd_reduce( 

struct sajed *sared, J* pointer to information on the reduce 7 
struct volgrp *vg) I* pointer to volume group structure 7 

[ 

register int i,ppcnt,lpcnt,cpcnt; 

r loop counters 7 
register struct part *pp,*op,*np,*sp,*tp; 

r part struct pointers 7 
register struct extred_part *pplist; 

I* pointer to array of ppinfo structs 7 
register int ppsleft; I* mask for pps left after reduction 7 
register int copy; f holds copy of Ip we're processing 7 
register int redpps, cpymskf masks for the logical partition 7 
register int zeromsk; f mask for copies to zero out 7 
register int size; t size of old part structs to copy to new 7 

struct part zeropp; f zeroed out part struct used to zero parts 7 

pplist = sared-> list; 

bzero((char *)(&zeropp), sizeof(struct part)); 

r 

* go through the pps being reduced and update the old copy as needed. 
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* Do the necessary promotions and deletions in the old copy PRIOR to 

* copying things over to the new copy. 
7 

cpymsk = MIRROR_EXIST(sared-> lv-> nparts); 
for(ppcnt = 1 , ppcnt < = sared- > numred; pplist + + , ppcnt + + ) { 
if (pplist-> mask != 0) { 
redpps = cpymsk | pplist-> mask; 

r ' 

* NOTE: redpps is a 3 bit field that can have the values 

* 0 (000) - 7 (111). The zero condition cannot exist on a reduce, 

* however. 
7 

switch(redpps) { 

I* promote secondary to primary and tertiary to secondary */ 

case 1 : pp = PARTITION(sared-> lv,(pplist-> lp_num-1 ),PRIMMIRROR); 

sp = PARTITION(sared-> lv,(pplist-> lp_num-1),SINGMIRROR); 

tp = PARTITION(sared-> lv,(pplist-> lp_num-1),DOUBMIRROR); 

jpp = *sp; 

* set up a mask to show the promoted Ip 

* the bits will be off for good copies and on for 

* the copies that are now invalid. 
7 

*sp "™ *tp* 

zeromsk= TERTIARY_MIRROR; 
break; 

I* promote tertiary to secondary 7 

case 2: sp = PARTITION(sared-> lv,(pplist-> lp_num-1 ),SINGMIRROR); 
tp = PARTTTION(sared-> lv,(pplist-> lp_num-1),DOUBMIRROR); 
*Sp *tp* 

zeromsk = TERTIARY_MIRROR; 
break; 

1* promote tertiary to primary 7 

case 3: pp = PARTITION(sared-> lv,(pplist-> lp_num-1 ),PRIMMIRROR); 
tp = PARTITION(sared-> lv,(pplist-> lp_num-1 J.DOUBMIRROR); 
*PP = *tp; 

zeromsk = (TERTIARY_MIRROR | SECONDARY.MIRROR); 
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break; 
r no promotion */ 
case 4: 
case 6: 

case 7: zeromsk = redpps; 
break; 

r promote secondary to primary V 

case 5: pp = PARTITION(sared-> lv,(pplist-> lp_num-1 ),PRIMMIRROR); 
sp = PARTITION(sared-> lv,(pplist-> lp_num-1 ),SINGMIRROR); 

zeromsk (TERTIARY.MIRROR | SECONDARY_MIRROR); 
break; 
} r end switch 7 

r set up a mask of copies to zero out */ 
zeromsk & = "cpymsk; 

r zero out the necessary copies of the logical partition */ 
whilefzeromsk != 0) { 

copy = FIRST_MASK(zeromsk); 

pp = PARTITION(sared-> lv,(pplist-> lp_num-1), copy); 

*pp = zeropp; 

zeromsk & = ~MIRROR_MASK(copy); 

) 

}T endif*/ 
)/*endforppcntV 

r go through and transfer each copy to the new part structure 7 
for(cpcnt = 0; cpcnt < sared-> nparts; cpcnt + + ) { 

size = sared-> numpls * sizeof(struct part); 

bcopy(sared-> lv-> partsff[cpcnt], sared-> newpartsff[cpcnt], size); 

sared-> lv-> partsff[cpcnt] = sared-> newpartsff[cpcnt]; 

r NULL out the pointers to the copies that no longer exist V 
for(i = sared-> nparts; i < sared-> lv-> nparts; i + + ) 
sared-> lv-> partsffO] = NULL; 

r 

* reset the Ivol structure with the values in the extred 

* structure and loop through to put the newparts pointers 

* into the Ivol parts field 

V 
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sared-> lv-> nparts = sared-> nparts; 

sared-> lv-> nblocks = PART2BLK(vg-> partshift, sared-> numlps; 

sared-> lv-> i__sched = sared-> isched; 

return; 

} 



r DASD.H 7 

#ifndef_H_DASD 
#define_H_DASD 

r 

* COMPONENTJJAME: (SYSXLVM) Logical Volume Manager - dasd.h 

* © COPYRIGHTIntematlonal Business Machines Corp. 1 988, 1 990 

* All Rights Reserved 

•/ 

r 

Logical Volume Manager Device Driver data structures. 

V 

#include < sys/types.h > 
#include < sys/sleep.h > 
include < sys/lockl.h > 
include < sys/sysmacros.h > 
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#include < sys/buf.h > 
#include < sys/ivdd.h > 

r FIFO queue structure for scheduling logical requests. 7 
struct hd_queue { f queue header structure 7 

struct buf •head; f oldest request in the queue 7 
struct buf*tail; I* newest request in the queue 7 

); 

struct hd_capvq { f queue header structure 7 

struct pvwait *head; I* oldest request in the queue 7 
struct pv_wait Tail; f newest request in the queue 7 



* Structure used by hd_redquiet( ) to mark target PPs for removal. 

* Both are zero relative. 
7 

struct hdjvred { 

long Ip; /* LP the pp belongs to 7 
char mirror; I* mirror number of PP 7 

); 
r 

* Physical request buf structure. 

A 'pbuf is a *buf structure with some additional fields used 

* to track the status of the physical requests that correspond to 
each logical request A pool of pinned pbuf s is allocated and 
managed by the device driver. The size of this pool depends on 

* the number of open logical volumes. 
7 

struct pbuf { 

r this must come first, 'buf pointers can be cast to 'pbuf 7 
struct buf pb; f imbedded buf for physical driver 7 

I* physical buf structure appendage: 7 

struct buf 'pbjbuf; T corresponding logical buf struct 7 
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r scheduler I/O done policy function 7 
#ifndef_NO_PROTO 

void (*pb_sched) (struct pbuf *); 

#else 

void (*pb_sched) ( ): 

#endif 

struct pvol *pbj)vol; I* physical volume structure 7 
struct bad_blk *pbbad; /* defects directory entry */ 
daddrj pb_start; T starting physical address 7 

char pbmirror; I* current mirror 7 

char pbmiravoid; /* mirror avoidance mask 7 

char pbmirbad; /* mask of broken mirrors 7 

char pbjnirdone; I* mask of mirrors done v 7 

char pb_swretry; T number of sw relocation retries 7 

char pbjype; i* Type of pbuf 7 

char pb_bbop; /* BB directory operation 7 

char pbbbstat; I* status of BB directory operation 7 

uchar pb_whl_stop; /* wheeijdx value when this pbuf is*/ 

r to get off of the wheel 7 

#ifdef DEBUG 

ushort pb_hwjeloc; /*Debug-itwasaHWrelocrequest7 
char pad; f pad to full long word 7 

#eise 

char padffp]; f pad to full long word 7 

#endif 

struct part *pb __part; f ptr to part structure. Care must7 

I* be taken when this is used since 7 
I* the parts structure can be moved 7 
I* by hdjconfig routines while the 7 
r request is in flight 7 
struct unique Jd *pb_ygid; f volume group ID 7 
r used to dump the allocated pbuf at dump time 7 
struct pbuf *pbJorw; f forward pointer 7 
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struct pbuf *pb_back; I* backward pointer V 

); 

#define pb_addr pb.b_un.b_addr f too ugly in its raw form 7 
I* defines for pb_swretry7 

#define MAX_SWRETRY 3 f maximum retries for relocation 

before declaring disk dead 7 

r values for b_work in pbuf struct (since real bjvork value only used 

* in Ibuf) 

7 

#define FIX_READ_ERROR 1 f fix a previous EMEDIA read error 7 
#defme FIX_ESOFT 2 I* fix a read or write ESOFT error 7 
#define FIX __EMEDIA 3 r fix a write EMEDIA error 7 

T defines for pbjtype 7 

#define SA_PVMISSING 1 f PV missing type request 7 

#define SA_STALEPP 2 /* stale PP type request 7 

#define SAJRESHPP 3 /* fresh PP type request 7 

#define SA_CONFlGOP 4 f hd_config operation type request 7 

r 

* defines to tell hdjbldpbuf what kind of pbuf to build 

* These defines are not the only ones that tell hd_bldpbuf what to 

* build. Check the routine before changing/adding new defines here 
7 

#define CATYPEJVRT 1 I* pbuf struct is a cache write type 7 

r 

'.defines for pb_bbop 

* First set is used by the requests pbuf that is requesting the BB operation. 

* The second set is used in the bb_pbuf to control the action of the 

* actual reading and writing of the BB directory of the PV. 
7 
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#define BB_ADD 41 f Add a new bad block entry to BB directory 7 

#define BBJJPDATE 42 /* Update a bad block entry to BB directory */ 

#define BB_DELETE 43 f Delete a bad block entry to BB directory 7 

#deftne BB RDDFCT 44 P Reading a defective block 7 

#define BB~WTDFCT 45 f Writing a defective block 7 

#define BB SWRELO 46 /* Software relocation in progress 7 

#define RD~BBPRIM 70 P Read the BB primary directory 7 

#define WT.UBBPRIM 71 I* Write BB prim dir with UPDATE 7 

#define WTJ3BBPRIM 72 /* Rewrite BB prim dir 1st bik with UPDATE 7 

#define WTJJBBBACK 73 r Write BB backup dir with UPDATE 7 

#define WT DBBBACK 74 f Rewrite BB back dir 1st blk with UPDATE 7 



r defines for pb_berror: 0-63 (good) 64-127 (bad) 7 
#define BB_SUCCESS 0 TBBdir updating worked 7 
#defineBB_CRB 1 /* Reloc blkno was changed in this BB entry 7 

#defineBB_ERROR 64 f Bad Block directories were not updated 7 
#define BB FULL 65 r BBdir is Ml -no free bad blk entries 7 



* Volume group structure. 
* 

* Volume groups are implicitly open when any of their logical volumes are. 
7 



#defineMAXVGS 
#define MAXLVS 
#define MAXPVS 



255 f implementation limit on # VGs 7 

256 r implementation limit on # LVs 7 
32 /* implementation limit on number7 
I* physical volumes per vg 7 
8 I* Number of mwc cache queues 7 

#define NBPI (NBPB * sizeof (int)) /* Number of bits per int 7 
#define NBPL (NBPB * sizeof (long)) /* Number of bits per long 7 



#define CAHHSIZE 



r macros to set and dear the bits in the opn __pin array 7 

#define SETLVOPN(Vg,N) ((Vg)-> opn_pinff[(N)/NBPI] |= 1 < < ((N)%NBPI)) 
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#define CLRLVOPN(Vg.N) ((Vg)-> opn_pinffl(N)/ NBPI] &= "(1 < M 
#define TSTLVOPN(Vg.N) ((Vg)-> opn_pinff[(N)/ NBPI] & 1 < < «N)%NBPI)) 

r 

* macros to set and clear the bits in the caj>v_wrt field 

* NOTE TSTALLPVWRT will not work if max PVs per VG is greater than 32 
7 

#define SETPVWRT(Vg.N) ((Vg)-> ca jjvjwtip / NBPI] |= 1 « ((N) % MAXPVS)) 
#define CLRPVWRT(Vg.N) ((Vg)-> ca _pv_wrtff[(N) / NBPI] 4 = "(1 « ((N) % MAXPVS)) 
#define TSTPVWRT(Vg.N) ((Vg)-> ca j)v_wrtfiI(N) / NBPI] & (1 « ((N) % MAXPVS))) 
#define TSTALLPVWRT(Vg,N) ((Vg)->cajv_wrtft[(MAXPVS- 1)/NBPL» 

r 

* head of list of varied on volgrp structs in the system 
7 

struct 

' lockj lock; /* lock while manipulating list of VG structs */ 

struct volgrp * ptr; f ptr to list of varied on VG structs V 
) hd vghead = {EVENT_NULL, NULL); 



struct volgrp { 

lockj vgjock; /* lock for all vg structures 7 

short pad1; /* pad to long word boundary 7 

short partshift; r log base 2 of part size in Wks 7 

short openjcount; f count of open logical volumes */ 

ushort flags; /* VG flags field 7 

ulong tot_io_cnt; /* number of logical request to VG 7 

struct Ivol *lvolsff[MAXLVSl; /* logical volume struct array 7 

struct pvol *pvolsff[MAXPVS]; /* physical volume struct array 7 

long majorjium; I* major number of volume group 7 

struct uniquejd vgjd; /* volume group id 7 

struct volgrp *nextvg; I* pointer to next volgrp structure 7 

T Array of bits indicating open LVs7 

TAbitperLV 7 
int opn_pinf((MAXLVS + (NBPI - 1 ))/NBPI]; 
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pidj vonjDid; f process ID of the varyon process 7 

r Following used in write consistency cache management 7 
struct volgrp *nxtactvg; f pointer to next volgrp with 7 

r write consistency activity 7 
struct pvjvait *cajfreepvw; /* head of pv__wait free list 7 
struct pvjvait *ca_pvwmem; f ptr to memory malloced for pvw 7 

Tfree list 7 
struct hd_queue cajild; /* head/tail of cache hold queue 7 
ulong ca_pv_wrtff[(MAXPVS + (NBPL - 1 )) / NBPL]; 

r when bit set write cache to PV 7 
char ca_inflt_cnt; f number of PV active writing cache7 

char ca_size; r number of entries in cache 7 

ushort ca _pvwblked; /* number of times the pvwait free 7 

r list has been empty ~7 
struct mwcjec *mwc_rec; I* ptr to part 1 of cache - disk rec7 
struct ca_mwc_mp *ca_part2; /* ptr to part 2 of cache - memory 7 
struct cajnwcmp 'cajst; I* mru/lru cache list anchor 7 
struct cajnwcmp ^hashfffCAHHSIZE];/* write consistency hash anchors7 

I* the following 2 variables are used to control a cache clean up opera-7 
rtion. 7 
pid_t bcachwait; /* list waiting at the beginning 7 

pidj ecachwait; /* list waiting at the end 7 

volatile int wait_cnt; f count of cleanup waiters 7 

f the following are used to control the VGSAs and the wheel 7 

uchar quorum_cnt; f Number indicating quorum of SAs 7 

uchar wheel Jdx; /* VGSA wheel index into pvols 7 

ushort whl_seq_num; I* VGSA memory image sequence number7 

struct pbuf *sa_act_lst; r head of list of pbufs that are 7 

r actively on the VGSA wheel 7 
struct pbuf *sa_hldjst; f head of list of pbufs that are 7 

r waiting to get on the VGSA wheel 7 
struct vgsa_area *vgsa_ptr; I* ptr to in memoiy copy of VGSA 7 
pidj config_wait; TPID of process waiting in the 7 
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); 



r hd_config routines to modify the V 

r memory version of the VGSA 7 
struct but sajbuf ; /* logical buf struct to use to wrt 7 

rtheVGSAs 7 
struct pbuf sajibuf; f physical buf struct to use to wrt7 

rtheVGSAs 7 



r 

* Defines for flags field in volgrp structure 
7 

#define VG_SYSMGMT 0x0002 f VG is on for system management 7 

f only commands 7 
#define VG.FORCEDOFF 0x0004 f Should only be on when the VG was7 
#deftne VGjOPENING 0x0008 f VG is being varied on 7 • 
r forced varied off and there were LVs still open. Under this con-7 
r dition the driver entry points can not be deleted from the device*/ 
r switch table. Therefore the volgrp structure must be kept I* 
I* around to handle any rogue operations on this VG. 7 
#define CAJNFLT 0x0010 f The cache is being written or 7 

r locked 7 
#defineCA_VGACT 0x0020 I* This volgrp on mwc active list 7 
#define CA.HOLD 0x0040 r Hold the cache in flight 7 

#define CA^RJLL 0x0080 /* Cache is full - no free entries 7 

#define SA_WHL_ACT 0x0100 /* VGSA wheel is active 7 
#define SAJVHLHLD 0x0200 T VGSA wheel is on hold 7 
#define SA_WHL_WAIT 0x0400 /* config function is waiting for V 

r the wheel to stop 7 



* Logical volume structure. 
7 

struct Ivol { 

struct buf **work_Q; I* work in progress hash table 7 
short lv_status ; f Iv status: closed, closing, open 7 
short hf options; /* logical dev options (see below) 7 
short nparts; /*num of part structures for this 7 

r Iv - base 1 7 
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7 



}; 



char i_sched; /* initial scheduler policy state V 
char pad; f padding so data word aligned 7 
ulong nblocks; T LV length in blocks 

struct part *partsff[3] r partition arrays for each mirror / 
ulong totjwrts; f total number of writes to LV J 

ulong totjds; f total number of reads to LV / 

r These fields of the Ivol structure are read and/or written by 

* the bottom half of the LVDD; and therefore must be carefully 

* modified. 
*/ 

int cmplcnt; f completion count-used to quiesce 7^ 
int waitlist; I* event list for quiesce of LV 



7 



r lv status: 7 

#define LV_CLOSED 0 
#define LV_CLOSING 1 
#define LVOPEN 2 

T scheduling policies: 7 
#define SCH_REGULAR 0 
#define SCH_SEQUENTIAL 1 
fldefine SCHPARALLEL 2 
#define SCHJSEQWRTPARRD 
#define SCH PARWRTSEQRD 



r logical volumes is closed 7 

/* trying to close the LV 7 
r logical volume is open 7 

I* regular, non_mirrored LV 7 
1* sequential write, seq read 7 
T parallel write, read closest 7 

3 r sequential write, read closest*/ 

4 r parallel write, seq read 7 



f logical device options: 7 
#define LV_NOBBREL 
#define LV_RDONLY 
#define LV.DMPINPRG 
#define LV_DMPDEV 

#defineLV_NOMWC 

#define LVJVRITEV 



0x0010 r no bad block relocation 7 

0x0020 I* read-only logical volume 7 

0x0040 I* Dump in progress to this LV 

0x0080 r This LV is a DUMP device 7 
r U.DUMPINIT has been done 7 

0x01 00 r no mirror write consistency 7 
r checking 7 

WRITEV I* Write verify writes in LV 7 



7 
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P work_Q hash algorithm - just a stub now 7 
#defineHD HASH(Lb) \ 

(BLK2TRK((Lb)-> b blkno)&(WORKQ_SIZE-1)) 



p 

* Partition structure. 
7 

struct part { 

struct pvol *pvol; /* containing physical volume / 

daddrj start; P starting physical disk address 7 
short syncjrk; P current LTG being resynced 7 
char ppstate; P physical partition state 7 
char syncmsk; P current LTG sync mask 7 

}; 
p 

* Physical partition state defines PP_and structure defines. 

* The PP_STALE and PP_REDUCING bits could be combined into one but it 

* is easier to understand if they are not and a problem arises later. 

* The PP_RIP bit is only valid in the primary part structure. 

7 

#define PP_STALE 0x01 P Set when PP is stale / 

#define PP_CHGING 0x02 P Set when PP is stale but the 7 

P VGSAs have not been completely 7 
P updated yet 7 

#define PP_REDUCING 0x04 P Set when PP is in the process 7 

Pot being removed(reduced out 7 

^define PP_RIP 0x08 P Set when a Resync is in progress 7 

P When set "syncjrk" indicates 7 
P the track being synced. If 7 
P syncjrk not = = -1 and PP_RIP 7 
P not set syncjrk is next trk 7 
P to be synced 7 

#definePP_SYNCERR 0x10 P Set when error in a partition 7 
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P being resynced. Causes the V 
P partition to remain stale. 7 

#define NO_SYNCTRK -1 PTheLPdoesnothayearesync V 

P in progress V 

p 

* Physical volume structure. 

* Contains defects directory hash anchor table. The defects 

* directory is hashed by track group within partition. Entries within 

* each congruence class are sorted in ascending block addresses. 

* This scheme doesn't quite work, yet. The congruence classes need 

* to be aligned with logical track groups or partitions to guarantee 

* that all blocks of this request are checked. But physical addresses 

* need not be aligned on track group boundaries. 
7 

#define HASHSIZE 64 P number of defect hash classes */ 

struct dsf 6ct tbl { 

strucTbadJIk 'defects ff[HASHSIZE]; P defect directory anchor V 

}; 

struct pvol { 

devj dev; P devj of physical device V 

daddrj armpos; P last requested arm position 7 
short xfcnt; P transfer count for this pv 7 

short pvstate; PPV state 7 
short pvnum; PLVMPV number 0-31 
short vg_num; PVG major number 7 
struct file* fp; r file pointer from open of PV 7 
char flags; P place to hold flags 7 
char pad; P unused 7 
short num_bbdir_ent; P current number of BB Dir entries 7 
daddrj fst_usr_blk; P first available block on the PV 7 

P for user data 7 
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daddrj begjelblk;/* first blkno in reloc pool V 
daddrj nextjelblk; f blkno of next unused relocation V 

f block in reloc blk pool at end 7 

r of pv 7 

daddrj maxjelblk; /* largest blkno avail for reloc V 
struct defectjbl *defect_tbl; f pointer to defect table 7 
struct hd_capvq ca_pv; t head/tail of queue of request 7 

/* waiting for cache write to V 
r complete 7 
struct sa_pv_whl { f VGSA information for this PV */ 
daddrj Isn; f SA logical sector number - LV 0 */ 
ushort sa_seq_num; fSA wheel sequence number 7 
char nukesa; t flag set if SA to be deleted 7 
char pad; f pad to full long word 7 • 
} sa_areaff[2]; Tone for each possible SA on PV 7 

struct pbuf pv_pbuf ; f pbuf struct for writing cache 7 

); 

r defines for pvstate field 7 

#define PV_MISSING 1 f PV cannot be accessed 7 
#define PV.RORELOC 2 f No HW or SW relocation allowed 7 

r only known bad blks relocated 7 

r 

* returns index into the bad block hash table for this block number 
7 

#define BBHASHJND(blkno) (BLK2TRK(blkno) & (HASHSIZE - 1 )) 

r 

* Macro to return defect directory congruence class pointer 
7 

#define HASH_BAD(Pb,BadJblkno) \ 

((Pb)-> pb_pvol-> defectJbl->defectsff[BLK2TRK(Bad_blkno)&(HASHSIZE-1)]) 

r 

* Used by the LVM dump device routines same as HASH_BAD but the first 

* argument is a pvol struct pointer 
7 
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define HASH_BAD_DMP(Pvol,Blkno) \ 

((Pvol)-> defect_tbl-> defectsff[BLK2TRK(Blkno)&(HASHSIZE-1]) 

r 

* Bad block directory entry. 

struct bad_blk { r bad block directory entry 7 

struct bad_blk 'next; f next entry in congruence class 7 
devj dev; f containing physical device 7 

daddrj blkno; /* bad physical disk address 7 
unsigned status: 4; /* relocation status (see below) 7 
unsigned relblk: 28; I* relocated physical disk address 7 

I* bad block relocation status values: 7 
#define RELJDONE 0 /* software relocation completed 7 

#define REL_PENDING 1 r software relocation in progress 7 
#define REL_DEVICE 2 /* device (HW) relocation requested 7 
#define REL_CHAINED 3 I* relocation blk structure exists 7 
#defme REL_DESIRED 8 r relocation desired-hi ortier bit on7 

r 

* Macros for getting and releasing bad block structures from the 

* pool of bad_blk structures. They are linked together by their next pointers. 
hd_freebad" points to the head of bad_blk free list 

* NOTE: Code must check if hd freebad != null before callinq 

* the GET BBLK macro. ~ 
7 

#defineGET_BBLK(Bad) {\ 

(Bad) = hd_freebad;\ 
hdjreebad = hdJreebad-> next; \ 
hd_freebad_cnt-;\ 

#define REL JBLK(Bad) (\ 

(Bad)->next = hd_freebad;\ 
hdjreebad = (Bad); \ 
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hdjreebad _cnt + + ;\ 
} ' " 

r 

* Macros for accessing these data structures. 

#define VG DEV2LV(Vg, Dev) 
#defineVGlDEV2LV(Vg f Pnum) 

define BLK2PART(Pshift ) Lbn) 
#define PART2BLK(Pshift t P_no) 
#define PARTITION(Lv,P_no,Mir) 

r 

* Mirror bit definitions 

7 

#define PRIMARY_MIRROR 
#define SECONDARY MIRROR 
#define TERTIARY_MlRROR 
#defineALL_MIRRORS 

r macro to extract mirror avoidance mask from ext parameter 7 

#define X_AVOID(Ext) ( ((Ext) > > AVOID.SHFT) & ALL_MIRRORS ) 

r 

* Macros to select mirrors using avoidance masks: 

* FIRSTJ/IIRROR returns first unmasked mirror (0 to 2); 3 if all masked 

* FIRST_MASK returns first unmasked mirror (0 to 2); 3 if none masked 

* MIRROR_COUNT returns number of unmasked mirror (0 to 3) 

* MIRROR_MASK returns a mask to avoid a specific minor (1 , 2, 4) 
MIRROR_EXIST returns a mask for non-existent mirrors (0, 4, 6, or 7) 

#define FIRST JIRROR(Mask) ((0x301 0201 0 > > ((Mask) < < 2))&0x0f) 

#define FIRST _MASK(Mask) ((0x01020103 > > ((Mask) < < 2))&0x0ft 

define MIRROR JXDUNT(Mask) ((0x01121223 > > ((Mask) < < 2))&0x0f) 

#define MIRROR_EXIST(Nmirrors) ((0x00000467 > > ((Nminors) < < 2))&0x0f) 
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002 r secondary mirror mask 7 

004 r tertiary minor mask 7 
007 r mask of all mirror bits 7 
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#define MIRROR_MASK(Mirror) (1 < < (Mirror)) 

r 

* DBSIZE and DBSHIFT were originally UBSIZE and UBSHIFT from param.h. 

* There were renamed and moved to here to more closely resemble a disk 

* block and not a user block size. 
7 

#define DBSIZE 512 f Disk block size in bytes 7 

#deftne DBSHIFT 9 I* log 2 of DBSIZE 7 



r 

* LVPAGESIZE and LVPGSHIFT were originally PAGESIZE and PGSHIFT from param.h. 

* There were renamed and moved to here to isolate LVM from the changable 

* system parameters that would have undesirable effects on LVM functionality. 
7 

#deftne LVPAGESIZE 4096 f Page size in bytes 7 
define LVPGSHIFT 12 I* log 2 of LVPAGESIZE 7 



(LVPAGESIZE/DBSIZE) f blocks per page 7 
(LVPGSHIFT-DBSHIFT) f log 2 of BPPG 7 

32 r pages per logical track group 7 
5 Hog base 2 of PGPTRK 7 

(TRKSHIFT + BPPGSHIFT) f logical track group log base 27 
micuc u i PGPTRK 'LVPAGESIZE /* bytes per logical track group 7 

^define BLKPTRK PGPTRK *BPPG I* blocks per logical track group7 

#define SIGNED SHFTMSK 0x80000000 /* signed mask for shifting to V 

/* get page affected mask 7 



#define BPPG 
#define BPPGSHIFT 
#define PGPTRK 
#define TRKSHIFT 
#define LTGSHIR 
#defme BYTEPTRK 



#define BLK2BYTE(Nblocks) 
#defineBYTE2BLK(Nbiocks) 
#define BLK2PG(Blk) 
#definePG2BLK(Pageno) 
#define BLK2TRK(Blk) 
#define TRK2BLK(T_no) 
#define PG2TRK(Pageno) 



((unsigned)(Nblocks) < < (DBSHIFT)) 
((unsigned)(Nbytes) > > (DBSHIFT)) 
((unsigned)(Blk) > > BPPGSHIFT) 
((Pageno) < < (LVPGSHIFT-DBSHIFT)) 
((unsigned)(Blk) > > (TRKSHIFT + BPPGSHIFT)) 
((unsigned)(T_no) < < (TRKSHIFT + BPPGSHIFT)) 
((unsigned)(Pageno) > > TRKSHIFT) 
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r LTG per partition V 

#define TRKPPART(Pshift) ((unsigned)(1 < < (Pshift - LTGSHIFT))) 
I* LTG in the partition 7 

#defineTRK_IN_PART(Pshift,BLK)(BLK2TRK(Blk)&(TRKPPART(Pshift) - 1) ) 



I* defines for top half of LVDD 7 
#define LVDD_HFREE_BB 
#define LVDD_LFREE_BB 
#defineWORKQ_SIZE 64 
#define PBSUBPOOLSIZE 16 
#define HD_AUGN (uint)0 
#define FULL_WORDMASK 
#define BUFCNT 3 

r structs 



30 r high water mark for kernel bad_blk struct 7 
15 r low water mark for kernel bad_blk struct 7 
r size of LVs work in progress queue 7 
r size of pbuf subpool alloc'd by PVs 7 
r align characteristics for alloc'd memory 7 
3 r mask for full word (log base 2) 7 
r parameter sent to uphysio for # buf 7 
to allocate 7 



#define NOMIRROR 0 
#define PRIMMIRROR 
#define SINGMIRRQR 
#define DOUBMIRROR 

#define MAXNUMPARTS 
#define PVNUMVGDAS 



r no mirrors 7 

0 r primary mirror 7 

1 I* one mirror 7 

2 r two mirrors 7 

3 r maximum number of parts in a logical part 7 
2 r max number of VGDA/VGSAs on a PV 7 



r return codes for LVDD top 1/2 7 
#define LVDD SUCCESS 0 f general success code 



7 



#define LVDD_ERROR 
#define LVDD NOALLOC 



-1 r general error code 7 

-200 r hdjnit: not able to allocate pool of bufs7 



#endif/* H DASD7 



r HD.H 7 

#ifndef_H_HD 
#define H HD 
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r 

* COMPONENT_NAME: (SYSXLVM) Logical Volume Manager Device Driver - hd.h 

* © COPYRIGHT International Business Machines Corp. 1988, 1990 

* All Rights Reserved 

* 



include < sys/errids.h > 



* LVDD internal macros and extern statidy declared variables. 
7 



/*LVM internal defines:*/ 
#define FAILURE 
#define SUCCESS 
#define MAXGRABLV 
#define MAXSYSVG 3 
#define CAHEAD 
#define CATAIL 
#define CA_MISS 
#define CA_HIT 
#defineCA LBHOLD 



0 r must be logic FALSE for 'if tests 7 
1 r must be logic TRUE for 'if tests 7 
16 r Max number of LVs to grab pbuf structs 7 
r Max number of VGs to grab pbuf structs 7 

1 r move cache entry to head of use list 7 

2 r move cache entry to tail of use list 7 

0 r MWC cache miss 7 

1 r MWC cache hit 7 

2 r The logical request should hold 7 



r 

* Following defines are used to communicate with the kernel process 
7 

#define LVDD_KP_TERM 0x80000000 f Terminate the kernel process 7 
#define LVDD_KP JADBLK 0x40000000 r Need more bad_blk structs 7 
#define LVDD_KP_ACTMSK OxCOOOOOOO r Mask of all events . 7 

r 

* Following defines are used in the boptions of the logical but struct. 

* They should be reserved in Ivdd.h in relationship to the ext parameters 
7 

Idefine REQJNCACH 0x40000000 r When set in the Ibuf b_options 7 

r the request is in the mirror 7 
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r write consistency cache 7 
#define REQ_VGSA 0x20000000 r When set in the Ibuf boptions 7 

r it means this is a VGSA write 7 
r and to use the special sa_pbuf 7 
r in the volgrp structure 7 

|ttH«MtttMttt ttttl tt«t«t«ttit*tt« tttittt 

* The following variables are only used in the kernel and therefore are 

* only included if the KERNEL variable is defined. 

* 

#ifdef_KERNEL 
#include < sys/syspesth > 

r 

* Set up a debug level if debug turned on 

#ifdef DEBUG 
#ifdef LVDD.PHYS 
BUGVDEFfdebugM, 0) 
#els6 

BUGXDEF(debugM) 

#endif 

#endif 

r 

* pending queue 

This is the primary data structure for passing work from 
the strategy routines (see hdjtratc) to the scheduler 
(see hd_sched.c) via the mirror write consistency logic. 
From this queue the request will go to one of three other 
queues. 

1. cache hold queue -If the request involves mirrors 
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and the write consistency cache is in flight. 

* i.e. being written to PVs. 

2. cache PV queue - If the request must wait for the 

* write consistency cache to be written to the PV. 

3. schedule queue - Requests are scheduled from this 
queue. 

This queue is only changed within a device driver critical section. 

V 

fifdef LVDD_PHYS 

struct hd_queue pending_Q; 

#else 

extern struct hdjqueue pending_Q; 
#endrf 

r 

* ready queue - - physical requests that are ready to start. 

This queue is only valid within a single critical section. 

* It really contains a list of pbuf s, but only the imbedded 
buf struct is of interest at this point Since the pointers 

* are of type (struct buf *) it is convenient that the queue be 
declared similarly. 

7 

#ifdefLVDD_PHYS 

struct buf *readyjQ = NULL; 

#else 

extern struct buf *ready_Q; 
#endif 

r 

* Chain of free and available pbuf structs. 
V 

#ifdef LVDD_PHYS 

struct pbuf *hd jreebuf = NULL; 
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#else 

extern struct pbuf *hd_freebuf; 
#endif 

r 

* Chain of pbuf structs currently allocated and pinned for LVDD use. 

* Only used at dump time and by crash to find them. 
7 

#ifdef LVDDPHYS 

struct pbuf *hd_dmpbuf = NULL; 
#else 

extern struct pbuf *hd_dmpbuf ; 
#endif 



Chain and count of free and available badjblk structs. 
The first open of a VG, really the first open of an LV, will cause 
LVDDHFREE_BB( currently 30 ) bad_blk structs to be allocated and 
chained here. After that when the count gets to LVDD Jf REE_BB(low 
water mark, currently 15) the kernel process will be kicked to go 
get more up to LVDD_HFREE_BB( high water mark ) more. 

*NOTE* hdjreebadjk is a lock mechanism to keep the top half of the 
driver and the kernel process from colliding. This would only 
happen if the last request before the last LV closed received 
an ESOFT or EMEDIA( and request was a write ) and the getting of 
a bad_blk struct caused the count to go below the low water 
mark. This would result in the kproc trying to put more 
structures on the list while hdjciose via hdjrefrebb would 
be removing them. 



r 



fflfdef 
int 



LVDD PHYS 



hdjreebadjk = LOCK_AVAIL: 

~ 'hdjreebad = NULL; 
hdjreebad_cnt = 0; 



struct 
int 



bad blk 



#else 
int 



hdjreebadjk; 
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extern struct bad_blk *hd_freebuf ; 
extern int hdjreebadcnt; 
#endif 

r 

* Chain of volgrp structs that have write consistency caches that need 

* to be written to PVs. This chain is used so ail incoming requests 

* can be scanned before putting the write consistency cache in flight 

* Once in flight the cache is locked out and any new requests will have 

* to wait for all cache writes to finish. 
7 

#ifdef LVDDPHYS 

struct volgrp *hd_vgL_mwc = NULL; 

#else 

extern struct volgrp *hd_vg_mwc; 
#endif 

r 

* The following arrays are used to allocate mirror write consistency 

* caches in a group of 8 per page. This is due to the way the hide 

* mechanism works only on page quantities. These two arrays should be 

* treated as being in lock step. The lock, hd_ca_lock, is used to 

* ensure only one process is playing with the arrays at any one time. 

#define VGS.CA ((MAXVGS + (NBPB-1))/NBPB) 
#ifdefLVDD_PHYS 

lockj hd_caJock = LOCK_AVAIL;T lock for cache arrays 7 

char cajdlocedff[VGS_CA]; f bit per VG with cache allocated 7 
struct mwcjec ^cajrpjJtrfflVGS^CA]; f 1 for each 8 VGs 7 
#else 

extern lockj hd_ca_lxk; 
extern char ca_allocedff[ ]; 

extern struct mwc rec *ca_grp_ptrff[ ; 
#endif 

r 

* The following variables are used to control the number of pbuf 
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* structures allocated for LVM use. It is based on the number of 

* PVs in varied on VGs. The first PV gets 64 structures and each 

* PV therefore gets 16 more. The number is reduced only when a 

* VG goes inactive, i.e. ail ifs LVs are closed. 

7 

#ifdefLVDDPHYS 

int hd_pbuf_cnt =0; /* Total Number of pbufs allocated 7 

int hdj)buf jrab = PBSUBPOOLSIZE; f Number of pbuf structs to allocate 

I* for each active PV on the system 7 

int hd_pbuf_min = PBSUBPOOLSIZE * 4; 

/'Number of pbuf to allocate for the 7 
r first PV on the system 7 

int hd_vgs_opn =0; I* Number of VGs opened 7 
int hd_lvs_opn = 0; /* Number of LVs opened 7 

int hd _pvs_opn =0; f Number of PVs in varied on VGs 7 

inthd_pbuf_inuse =0; /* Number of pbufs currently in use 7 

int hd_pbuf_maxuse =0; f Maximum number of pbufs in use during*/ 

r this boot 7 

#else 

extern inthd_pbuf_cnt; 
extern int hd_pbufjgrab; 
extern irrthdjDbufjriin; 
extern inthd_vgsj)pn; 
extern inthdJvs_opn; 
extern int hd_pvs_opn; 
extern int hd_pbuf Jnuse; 
extern int hd _pbuf_maxuse; 
#endif 



r 

* The following are used to update the bad block directory on a disk 
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#ifdef LVDD_PHYS 

struct pbuf *bb_pbuf ; /* ptr to pbuf reserved for BB dir updating 7 
struct hd_queue bb_hld; r holding Q used when there is a BB V 

r directory update in progress V 

#else . 

extern struct pbuf *bb _pbuf; 
extern struct hd_queue bb hid; 
#endif 

r 

* The following variables are used to communicate between the LVDD 

* and the kernel process. 

7 

#ifdef LVDD_PHYS 

pidj hdjpid =0; I* PID of the kernel process 7 
#else 

extern pidj hd_kpid; 
#endif 

r 

* The following variables are used in an attempt to keep some information 

* around about the performance and potential bottle necks in the driver. 

* Currently these must be looked at with crash or the kernel debugger. 

#ifdefLVDD_PHYS 

ulong hd_pendqblked =0;/* How many times the scheduling queue 7 

r (pending_Q) has been block due to no 7 
I* pbufs being available. 7 

#else 

extern ulong hd pendqblked; 
#endif 

r 

* The following are used to log error messages by LVDD. The dejata 

* is defined as a general 1 6 byte array, BUT, it's actual use is 

* totally dependent on the error type. 
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* define RESRC_NAME "LVDD" I* Resource name for error logging 7 
struct hd_errlog_ent{ f Error log entry structure 7 

struct err recO id; 
charde dataff[16]; 

); 

r macros to allocate and free pbuf structures 7 
#defineGET_PBUF(PB) (\ 

(PB) = hdjreebuf; \ 

hdjreebuf = (struct pbuf *) hd_freebuf-> pb.avJorw; \ 
hdj)bufJnuse + + ;\ 
if( hd_pbuf_inuse > hd_pbuf__maxuse ) \ 
hd_pbuf_maxuse = hd_pbuf_inuse; \ 

#define REL_PBUF(PB) { \ 

(PB)-> pb.av_forw = (struct buf *) hdjreebuf; \ 

hdjreebuf = (PB);\ 

hd_pbufjnuse~;\ 

r macros to allocate and free pv_wait structures 7 
#define GETJ>VWAIT(Pvw, Vg) { \ 

(Pvw) = (Vg)-> cajreepvw; \ 

(Vg)-> cajreepvw = (Pvw)-> nxt_pv_wait; \ 

#define REL_PVWAIT(Pvw, Vg) { \ 

(Pvw)-> nxt_pv_wait = (Vg)-> cajreepvw; \ 
JVg)-> ca_freepvw = (Pvw); \ 

#defineTST_PVWAIT(Vg) ((Vg)-> ca>epvw = = NULL) 

r 

* Macro to put volgrp ptr at head of the list of VGs waiting to start 

* MWC cache writes 
7 
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# define CA_VG_WRT( Vg ) {\ 

if( !((Vg)-> flags &CA_VGACT))\ 

[\ 

(Vg)-> nxtactvg = hd_vg_mwc; \ 
hd_vg_mwc = (Vg); \ 
(Vg)-> flags | = CA_VGACT; \ 



* Macro to determine if a physical request should be returned to 

* the scheduling layer or continue(resume). 

#definePB CONT(Pb) (\ 
if(((Pb)->pb_addr = = ((Pb)->pb_lbuf->b baddr + (Pb)-> pb lbuf-> b bcount)) II \ 
((Pb)->pb.bJags&B_ERROR))\" 
HD_SCHED((Pb));\ 

else\ 

hd_resume((Pb));\ 

/*' 

* HD_SCHED - - invoke scheduler policy routine for this request 

*^ For physical requests it invokes the physical operation end policy. 
#define HD_SCHED(Pb) (*(Pb)-> pb_sched)(Pb) 



I* define for b_error value (only used by LVDD) 7 
#define ELBBLOCKED 255 r this logical request is blocked by V 

r another on in progress V 

#endif /*_KERNEL V 

r 

| Write consistency cache structures and macros 
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r cache hash algorithms - returns index into cache hash table */ 

#define CA_HASH(Lb) (BLK2TRK((Lb)-> b_blkno) & (CAHHSIZE-1 )) 

#define CA JHASHfTrk) ((Trk) & (CAHHSIZE-1 )) 

r 

* This structure will generally be referred to as part 2 of the cache 

struct ca__mwc_mp { r cache mirror write consistency memory only part */ 

struct ca_mwc_mp *hq_next; /*ptr to next hash queue entry 7 
char state; * /* State of entry 7 

char pad1; r Pad to word 7 

ushort iocnt; T Non-zero - io active to LTG 7 

struct ca_mwc_dp *part1; r Ptr to parti entry -cajnwcjip 7 
struct cajnwcjnp *next; f Next memory part struct 7 
struct ca_mwc_mp *prev; f Previous memory part struct 7 

); 

/*ca_mwc_mp state defines 7 

#define CANOCHG 0x00 f Cache entry has NOT changed since last 7 

f cache write operation, but is on a hash 7 
r queue somewhere 7 

#define CACHG 0x01 f Cache entry has changed since last cache 7 

r write operation 7 

#define CACLEAN 0x02 f Cache entry has not been used since last 7 

1* clean up operation 7 

r 

* This structure will generally be referred to as part 1 of the cache 

* In order to stay long word aligned this structure has a 2 byte pad. 

* This reduces the number of cache entries available in the cache. 
7 

struct ca_mwc_dp{ I* cache mirror write consistency disk part 7 

ulong Ivjtg; /* LV logical track group 7 

ushort Ivminor; /* LV minor number 7 
short pad; 

}; 
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#define MAX_CA_ENT 62 f Max number that will fit in block V 

r 

* This structure must be maintained to be 1 block in length(512 bytes). 

* This also implies the maximum number of write consistency cache entries. 
7 

struct mwcjec { f mirror write consistency disk record */ 

struct timestrucj bjmstamp; f Time stamp at beginning of block 7 

struct ca_mwc_dp <aj)lff[MAXCA_ENT]; t Reserve 62 part 1 structures 7 

struct timestrucj ejmstamp; f Time stamp at end of block 7 

); 



* This structure is used by the MWCM. It is hung on the PV cache write 

* queues to indicate which Ibufs are waiting on any particular PV. The 

* define controls how much memory to allocate to hold these structures. 

* The algorithm is 3 * CA_MULT * cache size * size of structure. 

7 

#define CA_MULT 4 f pv_wait * cache size multiplier 7 

struct pv_wait ( 

struct pv_wait *nxt_pv_wait; f next pv_wait structure on chain 7 
struct buf *lb_wait; /* ptr to Ibuf waiting for cache 7 

}; 
r 

* LYM function declarations - arranged by module in order by how they occur 

* in said module. 
7 

#ifdef_KERNEL 
#ifndef_NO_PROTO 

f hdjnircach.c7 

extern int hd_ca_ckcach ( 

register struct buf *lb, /* current logical buf struct 7 
register struct volgrp *vg, /* ptr to volgrp structure 7 
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register struct Ivol *lv); I* ptr to Ivol structure 7 

extern void hd_ca_use ( 

register struct volgrp *vg, I* ptr to volgrp structure V 
register struct ca_mwc_mp *ca_ent,f cache entry pointer 7 
register inthj); /* head/tail flag 7 

extern struct ca_mwc_mp *hd_ca_new ( 

register struct volgrp *vg) ;/* ptr to volgrp structure 7 

extern void hd_ca_wrt (void); 

extern void hd_ca_wend ( 

register struct pbuf *pb); I* Address of pbuf completed 7 

extern void hd_ca_sked ( 

register struct volgrp *vg, f ptr to volgrp structure 7 

register struct pvol *pvol); fpvol ptr for this PV 7 

extern struct ca_mwc_mp *hd_ca_fnd ( 

register struct volgrp *vg, /* ptr to volgrp structure 7 
register struct buf *lb); /* ptr to Ibuf to find the entry*/ 

ffor 7 

extern void hd_ca_clnup ( 

register struct volgrp *vg);f ptr to volgrp structure 7 

extern void hd_ca_qunlk ( 

register struct volgrp *vg, f ptr to volgrp structure 7 
register struct ca_mwc_mp *ca_ent) f ptr to entry to unlink 7 

extern inthd_caj)vque( 

register struct buf *lb, /* current logical buf struct 7 

register struct volgrp *vg, /* ptr to volgrp structure 7 

register struct Ivol *lv); /* ptr to Ivol structure 7 

extern void hd__ca_end ( 

register struct pbuf *pbj; /* physical device buf struct 7 
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extern void hdcajerm ( 

register struct but *lb); f current logical but struct 7 

extern void hd_ca_mvhld ( 

register struct volgrp *vg);f ptr to volgrp structure 7 

f hd_dump.c V 

extern int hd_dump ( 

devj dev, f major/minor of LV 7 

struct uio *uiop, I* ptr to uio struct describing operation*/ 

int cmd,r dump command 7 

char *arg, I* cmd dependent - ptr to dmp_query struct 7 

int chan, P not used 7 

int ext); /* not used 

extern int hd_dmpxlate ( 

register dev J dev, I* major/minor of LV 7 

register struct uio *luiop, r ptr to logical uio structure 7 

register struct volgrp *vg);r ptr to VG from device switch table*/ 

r hdjop.c 7 

extern int hd_open ( 

devjdev, r device number major.minor of LV to be opened 7 
int flags, I* read/write flag 7 
int chan, /* not used ^ 7 
int ext); /* not used 7 

extern int hd_allocpbuf(void); 

extern void hd_pbufdmpq( 

register struct pbuf *pb, /* new pbuf for chain 7 
register struct pbuf **qq); /* Ptr to queue anchor 7 

extern void hd_openbkout( 

int bopoint, /* point to start backing out 7 

FIG. 21-168 



EP 0 482 853 A2 



struct volgrp *vg); f struct volgrp ptr 7 

extern void hd_backout( 

int bopointy* point where error occurred & need to 7 
/* backout all structures pinned before 7 
I* this point 7 
struct Ivol *lv, r ptr to Ivoi to backout 7 
struct volgrp *vg); /* struct volgrp ptr 7 

extern int hd_close( 

devj dev. f device number major.minor of LV to be closed 7 
int chan, f not used 7 
int ext); I* not used 7 

extern void hd_vgcleanup( 

struct volgrp *vg); /* struct volgrp ptr 7 

extern void hd jrefrebb(void); 

extern int hd_allocbblk(void); 

extern inthdjead( 

devj dev, f num major.minor of LV to be read 7 

struct uio *uiop, f pointer to uio structure that specifies 7 
r location & length of caller's data buffer*/ 
int chan, /* not used 7 
int ext); /* extension parameters 7 

extern int hd_write( 

devj dev, r num major.minor of LV to be written 7 

struct uio *uiop, /* pointer to uio structure that specifies 7 
r location & length of caller's data buffer*/ 
int chan, /* not used 7 
int ext); f extension parameters 7 

extern int hd_mincnt( 

struct buf *bp, r ptr to but struct to be checked 7 

void *minparms); f ptr to ext value sent to uphysio by*/ 
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f hd_read/hd_write. 7 

extern int hd_joctl( 

devj dev, I* device number major.minor of LV to be opened */ 

int cmd, f specific iocti command to be performed 7 

int arg, /* addr of parameter blk for the specific cmd 7 

int mode, I* request origination V 

int chan, f not used 7 

int ext); f not used 7 . 

extern struct mwcjec * hd_alloca(void); 

extern void hdjJeallocaf 

register struct mwcjec *ca_ptr); /* ptr to cache to free */ 

extern void hd_nodumpvg( 
struct volgrp *); 

/*hdj>hys.c 7 

extern void hd_begin( 

register struct pbuf *pb, I* physical device buf struct 7 
register struct volgrp *vg); f pointer to volgrp struct 7 

extern void hd_end( 

register struct pbuf *pb); f physical device buf struct 7 

extern void hd_resume( 

register struct pbuf *pb); I* physical device buf struct 7 

extern void hdjeadyj 

register struct pbuf *pb); I* physical request buf 7 

extern void hd_start(void); 

extern void hdjjettime( 

register struct timestrucj *p Jme); /* old time 7 
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rhdbbreLcV 

extern int hd_chkblk( 

register struct pbuf *pb); /* physical device but struct 7 

extern void hd_bbend( 

register struct pbuf *pb); f physical device but struct 7 

extern void hd_baddone( 

register struct pbuf *pb) ; /* physical request to process */ 

extern void hd_badblk( 

register struct pbuf *pb); /* physical request to process */ 

extern void hd_swreloc( 

register struct pbuf *pb) ; f physical request to process 7 

extern daddrj hd_assignalt( 

register struct pbuf *pb); I* physical request to process */ 

extern struct bad_blk *hd_fndbbrel( 

register struct pbuf *pb) ; /* physical request to process 7 

extern void hd_nqbblk( 

register struct pbuf *pb); I* physical request to process 7 

extern void hd_dqbblk( 

register struct pbuf *pb, I* physical request to process 7 
register daddrj blkno); 

/* hdjchedx 7 

extern void hd_schedule(void); 

extern inthd_avoid( 

register struct buf *lb, r logical request buf 7 
register struct volgrp *vg);/* VG volgrp ptr 7 
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extern void hd_resyncpp( 

register struct pbuf *pb) ; f physical device but struct 7 

extern void hdjreshpp) 

register struct volgrp *vg, f pointer to volgrp struct 7 
register struct pbuf *pb); I* physical request buf 7 

extern void hd_mirread( 

register struct pbuf *pb); f physical device buf struct 7 

extern void hd_fixup( 

register struct pbuf *pb); f physical device buf struct 7 

extern void hd__stalepp( 

register struct volgrp *vg, f pointer to volgrp struct 7 
register struct pbuf *pb); " f physical device buf struct 7 



extern void hd_staleppe( 

register struct pbuf *pb); I* physical request buf 7 

extern void hd_xlate( 

register struct pbuf *pb, I* physical request buf 7 

register int mirror, I* mirror number 7 

register struct volgrp *vg);/*VG volgrp ptr 7 

extern int hd_regular( 

register struct buf *lb, r logical request buf 7 



register struct volgrp *vg);f volume group structure 7 

extern void hd_finished( 

register struct pbuf *pb); I* physical device buf struct 7 

extern int hd_sequential( 

register struct buf *lb, I* logical request buf 7 
register struct volgrp *vg);/* volume group structure 7 

extern int hd_seqnext( 

register struct pbuf 'pb r physical request buf 7 
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register struct volgrp *vg); T VG volgrp pointer */ 

extern void hd_seqwrite( 

register struct pbuf *pb) ; f physical device but struct 7 

extern int hd_parallel( 

register struct but *lb, /* logical request but 7 
register struct volgrp *vg);f volume group structure 7 

extern void hd_freeall( 

register struct pbuf *q); 1* write request queue 7 

extern void hd_append( 

register struct pbuf *pb, I* physical request pbuf 7 
register struct pbuf **qq); I* Ptr to write request queue anchor 7 

extern void hd_nearby( 

register struct pbuf *pb, f physical request pbuf 7 

register struct buf *lb, I* logical request buf 7 

register int mask, I* mirrors to avoid 7 

register struct volgrp *vg, f volume group structure 7 
register struct Ivol *lv); 

extern void hdj)arwrite( 

register struct pbuf *pb); I* physical device buf struct 7 

r hd_strat.c 7 

extern void hd_strategy( 

register struct buf *lb); /* input list of logical buf structs 7 

extern void hd_initiate( 

register struct buf 4 lb); I* input list of logical bufs 7 

extern struct buf *hd_reject( 

struct buf *lb, P offending buf structure 7 
int errno); /* error number 7 
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extern void hd_quiescevg( 

struct volgrp *vg); f pointer from device switch table 7 

extern void hd_quiet( 

devj dev, /* number major.minor of LV to quiesce 7 

struct volgrp *vg); Tptr from device switch table 7 

extern void hd_redquiet( 

devj dev, /* number major.minor of LV 7 

struct hdjvred *red_lst); f ptr to list of PPs to remove 7 

extern int hd_add2pool( 

register struct pbuf *subpool, f ptr to pbuf sub pool 7 
register struct pbuf *dmpq); r ptr to pbuf dump queue 7 

extern void hdjeallocpbuf(void); 

extern int hdjiumpbufs(void); 

extern void hd_terminate( 

register struct buf lb); /* logical buf struct 7 

extern void hd_unblock( 

register struct buf 'next, I* first request on hash chain 7 
register struct buf *lb); f logical request to reschedule*/ 

extern void hd_quelb ( 

register struct buf *lb, I* current logical buf struct 7 
register struct hd_queue *que); f queue structure ptr 7 

extern int hd_kdis_initmwc( 

struct volgrp *vg); I* volume group pointer 7 

extern int hd_kdis_dswadd( 

register devj device f device number of the VG 7 
register struct devsw *devsw); f address of the devsw entry 7 
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extern int hd_kdis_chgqrm( 

struct volgrp " *vg), /* volume group pointer 7 
short newqrm); I* new quorum count 7 

extern int hd_kproc(void) ; 

1* hd_vgsa.c 7 

extern int hd_sa_strt( 

register struct pbuf *pb, f physical device but struct */ 

register struct volgrp *vg, f volgrp pointer 7 

register int type); /* type of request 7 

extern void hd_sa_wrt( 

register struct volgrp *vg);/* volgrp pointer 7 

extern void hd_sa_iodone( 

register struct but *lb); /* ptr to Ibuf in VG just completed 7 

extern void hd_sa_cont( 

register struct volgrp *vg, /* volgrp pointer 7 
register int sajjpdated); f ptr to Ibuf in VG just completed 7 

extern void hd_sa_hback( 

register struct pbuf *head jjtr, I* head of pbuf list 7 
register struct pbuf *new_pbuf); t ptr to pbuf to append to list 7 

extern void hd_sa_rtn( 

register struct pbuf *head _ptr, I* head of pbuf list 7 
register int err fig); f if true return requests with 7 

/* ENXIO error 7 

extern int hd_sa_whladv( 

register struct volgrp *vg, /* volgrp pointer 7 
register int c_whljdx); I* current wheel index 7 

extern void hd__sa_update( 

register struct volgrp *vg); f* volgrp pointer 7 
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extern int hd_sa_qrmchk( 

register struct volgrp *vg);/* volgrp pointer */ 

extern int hd_sa_config( 

register struct volgrp *vg /* volgrp pointer */ 

register int type, /* type of hdjconfig request 7 

register caddrj arg); /* ptr to arguments for the request 7 

extern int hd_sa_onerev( 

register struct volgrp *vg, I* volgrp pointer V 

register struct pbuf *pv, t ptr pbuf structure 7 

register int type); /* type of hdjconfig request 7 

extern void hd_bldpbuf ( 

register struct pbuf *pb, /* ptr to pbuf struct 7 

register struct pvol *pvol, /* target pvol ptr 7 

register int type, /* type of pbuf to build 7 

register caddrj bufaddr/ data buffer address -system 7 

register unsigned cnt /* length of buffer 7 

register struct xmem *xmem, I* ptr to cross memory descriptor/ 

register void (*sched)()); f ptr to function ret void 7 

extern int hd_extend ( 

register sa_ext *saext); I* ptr to structure with extend info 7 

extern void hd_reduce( 

struct sajed *sared, /* ptr to structure with reduce info 7 
struct volgrp *vg); f ptr to volume group structure 7 

/*hd_bbdir.c7 

extern void hd_upd_bbdir( 

register struct pbuf *pb); /* physical request to process 7 

extern void hd_bbdirend( 

register struct pbuf *vgpb); /* ptrto VG bb_pbuf 7 

extern void hd_bbdirop( void ); 
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extern int hd_bbadd( 

register struct pbuf *vgpb) ; 1* ptr to VG bbjbuf V 

extern int hd_bbdel( 

register struct pbuf *vgpb) ; /* ptr to VG bb_pbuf 7 

extern int hd_bbupd( 

register struct pbuf *vgpb) ; I* ptr to VG bb_pbuf 7 

extern void hd_chkbbhld( void ); 

extern void hd_bbdirdone( 

register struct pbuf *origpb); r physical request to process 7 

extern void hd_logerr( 

register unsigned id, f original request to process 7 
register ulong dev, r device number 7 
register ulong argl, 
register ulong arg2); 

#else 

r See above for description of call arguments 7 
f hd_mircach.c7 

extern int hd_ca_ckcach (); 
extern void hd_ca_use(); 
extern struct cajnwcjnp *hd_ca_new ( ); 
extern void hd_ca_wrt(); 
extern void hd_ca_wend ( ); 

extern void hd_ca_sked ( ); 

extern struct cajnwcmp # hd_ca_fnd ( ); 
extern void hd_ca_clnup(); 
extern void hd_ca_qunlk(); 
extern int hd_ca _pvque (); 
extern void hd_ca_end(); 
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extern void hd_ca_term ( ); 

extern void hd_ca_mvhld ( ); 

r hd_dump.c V 

extern int hd_dump ( ); 

extern int hd_dmpxlate ( ); 

r hdjop.c V 

extern int hd_open ( ); 

extern int hd_allocpbuf ( ); 

extern void hd_pbufdmpq ( ); 

extern void hd_openbkout ( ); 

extern void hd_backout ( ); 

extern int hd_close ( ); 

extern int hd_vgcleanup ( ); 

extern void hd jrefrebb ( ); 

extern int hd_allocbblk ( ); 

extern int hd_read(); 

extern int hd_write ( ); 

extern int hd_mincnt ( ); 

extern int hdjoctl ( ); 
extern struct mwcjec *hd_alloca ( ); 

extern void hdjealloca ( ); 

extern void hdjiodumpvg ( ); 

I* hd_phys.c 7 

extern void hd_begin(); 

extern void hd_end ( ); 

extern void hdjesume ( ); 

extern void hdjeady ( ); 

extern void hd_start(); 

extern void hdjjettime ( ); 

r hd bbrei.c 7 
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extern int hdchkblk ( ); 

extern void hd_bbend ( ); 

extern void hd_baddone ( ); 

extern void hd_badblk ( ); 

extern void hd_swreloc ( ); 

extern daddrj hd_assignalt ( ); 
extern struct bad_blk *hd_fndbbrel ( ); 

extern void hdjiqbblk ( ); 

extern void hd_dqbblk ( ); 

r hd_sched.c 7 

extern void hd_schedule ( ); 

extern int hd_avoid ( ); 

extern void hdjesyncpp ( ); 

extern void hdjreshpp ( ); 

extern void hdjnirread ( ); 

extern void hdjixup ( ); 

extern void hd_stalepp ( ); 

extern void hdjtaleppe ( ); 

extern void hd_xlate ( ); 

extern int hdjegular ( ); 

extern void hdjinished ( ); 

extern int nonsequential ( ); 

extern int hd_seqnext ( ); 

extern void hdjeqwrite ( ); 

extern int hd_parallel ( ); 

extern void " " hdjreeall ( ); 

extern void hd_append ( ); 

extern void hdjiearby ( ); 

extern void hd_parwrite ( ); 

r hd_strat.c V 

extern void hdjtrategy ( ); 

extern void hdjnitiate ( ); 
extern struct but *hdjeject( ); 

extern void hd_quiescevg ( ); 
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extern void 


hd_sa_update ( ); 


extern int 
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extern int 
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extern void 
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extern void 


hd_upd_bbdir ( ); 


extern void 


hd_bbdirend ( ); 


extern void 


hd bbdirop ( ); 


extern int 


hd bbadd(); 


extern int 


hd_bbdel ( ); 
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extern int hd_bbupd ( ); 

extern void hd_chk_bbhld ( ); 

extern void hd_bbdirdone(); 

extern void hdjogerr ( ); 

#endif /*_NO_PROTOV 

#endif /^KERNEL 7 

#endjff H HDV 
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